-
Notifications
You must be signed in to change notification settings - Fork 12
Canonical Data Representation for Digital Signature
This document defines a canonical representation of data conforming to a data model that is largely a simplified form of JSON model.
The canonical represenation guarantees that any data with the same content (as defined in this document) reduces to an identical representation.
The purpose of the canonical representation is either streaming comparison of two data items or computation of cryptographic hash of data. So it is very unlikely to store the final canonical representation in its entirety or to decode it back to a data structure, though possible.
- Should be easy to produce from a series of events or in-memory data structure that is an output of a parser or a decoder
- Should not be necessary to implement a parser or a decoder just to produce a canonical representation
- The canonical representation is not to be decoded once created
- So it doesn't need to be easy or efficient to decode
- The size of the representation don't need to be as compact as possible, though must not be unnecessarily huge
- It is not for storage but streaming
- That means no need to employ complex bit fiddling
- The representation must be self-contained, meaning no out-of-band data such as schemas must be required for the interpretation of the representation
In this section, we define what constitutes content of data.
- Boolean
- true or false
- Unit
- A type with a single
()
value
- A type with a single
- Integer
- Arbitrarily large
- Unicode string
- Bytes
- A sequence of unsigned bytes
- Array
- A sequence of values of possibly different types
The data model above is quite restricted compared to many data models implied by data formats like JSON and YAML. So here is how typical data types are mapped to the canonical data model. This is a normative part of this specification.
Many data models have null
or similar. It is mapped to a unit
in the canonical representation.
A map or struct is represented with an array of arrays of length 2
representing key and value. The order of the key-value pairs should be consistent across generation of the canonical representation, resulting in the same order given the same set of key-value pairs.
Only maps with integer or string keys are considered valid, and when generating a canonical representation:
- Integers are sorted in ascending order
- Strings are sorted in lexicographical order
Symbols are written out as strings.
Unions are written out as an array of length 2
. the first item is a discriminator that is in most cases either an integer or a string, and the second is the value.
Once data is prepared against the data model specified above, the canonical encoding of the data is produced as defined in this section.
- The encoding is byte oriented, meaning the order of byte values is significant while the order of bits in a byte isn't
- A byte is non-negative integer, unless stated otherwise
- Fixed size, multibyte items are written out their most signficant byte first (i.e. in big endian)
- A sequence is chunked into multiple segments, each of which is at most
255
byte long excluding one byte length preceeding the content of a segment- A segment is written out as follows:
<length><length bytes>
- A segment is written out as follows:
- The whole sequence is considered terminated when the last segment is less than
255
byte long- If the length of a sequence is multiple of
255
, the last segment is of zero length
- If the length of a sequence is multiple of
- Boolean
-
false
is written as a byte of0
-
true
is written as a byte of1
-
- Unit
- A byte of
2
is written out
- A byte of
- Integer
- Zero
- A byte of
3
is written out
- A byte of
- Positive
- A byte of
4
is written out as an indicator of a positive integer value follows - The number of bytes to be written is written as a single byte
- The integer is written out from the most significant non-zero byte, according to the encoding of a byte sequence above
- A byte of
- Negative
- A byte of
5
is written out as an indicator that a negative integer value follows - The number of bytes to be written is written as a single byte
- The magnitude of the integer is written out from the most significant non-zero byte, according to the encoding of a byte sequence above
- A byte of
- Zero
- Unicode string
- A byte of
6
is written out as an indicator that a string value follows - A UTF-8 encoding of the string is written out, according to the encoding of a byte sequence above
- A byte of
- Bytes
- A byte of
7
is written out as an indicator that a byte sequence follows - The actual bytes are written out, according to the encoding of a byte sequence above
- A byte of
- Array
- A byte of
8
is written out as an indicator that an array value follows - Each value in the array is written out one by one
- After writing out the last item, a byte of
255
is written out to indicate the end of the array (this should be where the type indicator of the next item should have been written out)
- A byte of
- Multi-segments based encoding of strings and bytes are employed to permit easy implementation with a small, bounded buffer, rather than to achieve a compact encoding