You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: smile-specification.md
+12-1
Original file line number
Diff line number
Diff line change
@@ -19,6 +19,7 @@ This page covers current data format specification; which is planned to eventual
19
19
20
20
### Update history
21
21
22
+
* 2022-02-20: Clarify handling of "unused" bits (see issue #17) primarily regarding encoding of floating-point numbers, but more generally for all unused bits.
22
23
* 2022-01-26: Import fix to encoding of 7-bit encoded (safe) binary, wrt padding of the last byte.
23
24
* Version 1.0.4 -> 1.0.5
24
25
* 2021-03-18: Minor markup fixes, clarification to "simple literals, numbers" section
@@ -94,6 +95,11 @@ Use of certain byte values is limited:
94
95
95
96
Some general notes on tokens:
96
97
98
+
* (2022-02-20) Unused bits in encoded bytes:
99
+
* SHOULD be encoded as `0` bits by encoder
100
+
* MUST be ignored by decoders for purposes of decoding itself (MUST NOT affect result of decoding even if `1`)
101
+
* MAY, however, be verified by decoder but if so MUST NOT fail decoding by default; decoders MAY however report non-compliant `1` bits as warnings
102
+
* Decoders MAY additionally expose optional "strict" mode in which such non-compliant bit encoding does result in an error and decoding failure
97
103
* Strings are encoded using standard UTF-8 encoding; length is indicated either by using:
98
104
* 6-bit byte length prefix, for lengths 1 - 63 (0 is not used since there is separate token)
99
105
* End-of-String marker byte (0xFC) for variable length Strings.
@@ -105,16 +111,20 @@ Some general notes on tokens:
105
111
* This means that 2 byte VInt has 13 data bits, for example; and minimum number of bytes to represent a Java long (64 bits) is 10; 9 bytes would give 62 bits (8 * 7 + 6).
106
112
* Signed VInt values are handled using "zigzag" encoding, where sign bit is shifted to be the least-significant bit, and value is shifted left by one (i.e. multiplied by two).
107
113
* Unsigned VInts used as length indicators do NOT use zigzag encoding (since it is only needed to help with encoding of negative values)
114
+
* "Unused" bits in the last encoded byte should be handled as per earlier general note: left as `0`
108
115
* Length indicators are done using VInts (for binary data, unlimited length ("big") integer/decimal values)
109
116
* All length indicators define _actual_ length of data; not possibly encoded length (in case of "safe" encoding, encoded data is longer, and that length can be calculated from payload data length)
117
+
* "Unused" bits in the last encoded byte should be handled as per earlier general note: left as `0`
110
118
* Floating point values (IEEE 32 and 64-bit) are encoded using fixed-length big-endian encoding (7 bits used to avoid use of reserved bytes like 0xFF):
111
119
* Data is "right-aligned", meaning padding is prepended to the first byte (and its MSB).
112
120
* For example, the 32-bit float `29.9510 is` encoded as `0x26 0x37 0x3E 0x0F 0x04.` We get to this encoding by taking the IEEE 764 32-bit binary representation of the number 29.9510, (1) writing the least-significant 7 bits, (2) right-shifting 7 bits, and repeating the process until encoding the entire bit-string (5 times for a 32-bit float). As a result, 0x26 = 29.9510 & 0x7F, 0x37 = (29.9510 >> 7) & 0x7F, 0x3E = (29.9510 >> 14) & 0x7F, 0x0F = (29.9510 >> 21) & 0x7F, and 0x04 = (29.9510 >> 28) & 0x7F.
121
+
* "Unused" bits in the last encoded byte should be handled as per earlier general note: left as `0`
113
122
* "Big" decimal/integer values use "safe" binary encoding
114
123
* "Safe" binary encoding simply uses 7 LSB (sign bit, MSB, is left as 0).
115
124
* The last encoded byte contains 1 - 7 bits: if less than 7, data is "right-aligned", contained in Least-Significant Bits; there will be 0-6 MSB padding bits.
116
125
* For example: when encoding 4 bytes (32 bits), the first full (7-bit) encoded bytes (`0vvvvvvv`) are followed by an incomplete byte containing 4 value bits: `0000vvvv`.
117
-
* NOTE: before version 1.0.5 above statemet claimed incorrect alignment (claiming padding would be for LSB)
126
+
* NOTE: before version 1.0.5 above statement claimed incorrect alignment (claiming padding would be for LSB)
127
+
* "Unused" bits in the last encoded byte should be handled as per earlier general note: left as `0`
118
128
119
129
### Tokens: value mode
120
130
@@ -161,6 +171,7 @@ Prefix: 0x20; covers byte values 0x20 - 0x3F, although not all values are used
161
171
* 0x2A: `BigDecimal`
162
172
* Encoded as token indicator followed by zigzag encoded scale (32-bit), followed by 7-bit escaped binary (with Unsigned VInt (no-zigzag encoding) as length indicator) that represent magnitude value (byte array) of integral part.
163
173
* 0x2B - reserved for future use
174
+
* Note that possible "unused" bits in the last encoded byte should be handled as per earlier general note: left as `0`, ignored on decoding.
164
175
* Reserved for future use, avoided (decoding error if found)
165
176
* 0x2C - 0x2F reserved for future use (non-overlapping with keys)
0 commit comments