Skip to content

Commit b921adc

Browse files
committed
Fix #17: clarify handling, exceptations of encoding of "unused" bits.
1 parent 7ca479f commit b921adc

File tree

1 file changed

+12
-1
lines changed

1 file changed

+12
-1
lines changed

smile-specification.md

+12-1
Original file line numberDiff line numberDiff line change
@@ -19,6 +19,7 @@ This page covers current data format specification; which is planned to eventual
1919

2020
### Update history
2121

22+
* 2022-02-20: Clarify handling of "unused" bits (see issue #17) primarily regarding encoding of floating-point numbers, but more generally for all unused bits.
2223
* 2022-01-26: Import fix to encoding of 7-bit encoded (safe) binary, wrt padding of the last byte.
2324
* Version 1.0.4 -> 1.0.5
2425
* 2021-03-18: Minor markup fixes, clarification to "simple literals, numbers" section
@@ -94,6 +95,11 @@ Use of certain byte values is limited:
9495

9596
Some general notes on tokens:
9697

98+
* (2022-02-20) Unused bits in encoded bytes:
99+
* SHOULD be encoded as `0` bits by encoder
100+
* MUST be ignored by decoders for purposes of decoding itself (MUST NOT affect result of decoding even if `1`)
101+
* MAY, however, be verified by decoder but if so MUST NOT fail decoding by default; decoders MAY however report non-compliant `1` bits as warnings
102+
* Decoders MAY additionally expose optional "strict" mode in which such non-compliant bit encoding does result in an error and decoding failure
97103
* Strings are encoded using standard UTF-8 encoding; length is indicated either by using:
98104
* 6-bit byte length prefix, for lengths 1 - 63 (0 is not used since there is separate token)
99105
* End-of-String marker byte (0xFC) for variable length Strings.
@@ -105,16 +111,20 @@ Some general notes on tokens:
105111
* This means that 2 byte VInt has 13 data bits, for example; and minimum number of bytes to represent a Java long (64 bits) is 10; 9 bytes would give 62 bits (8 * 7 + 6).
106112
* Signed VInt values are handled using "zigzag" encoding, where sign bit is shifted to be the least-significant bit, and value is shifted left by one (i.e. multiplied by two).
107113
* Unsigned VInts used as length indicators do NOT use zigzag encoding (since it is only needed to help with encoding of negative values)
114+
* "Unused" bits in the last encoded byte should be handled as per earlier general note: left as `0`
108115
* Length indicators are done using VInts (for binary data, unlimited length ("big") integer/decimal values)
109116
* All length indicators define _actual_ length of data; not possibly encoded length (in case of "safe" encoding, encoded data is longer, and that length can be calculated from payload data length)
117+
* "Unused" bits in the last encoded byte should be handled as per earlier general note: left as `0`
110118
* Floating point values (IEEE 32 and 64-bit) are encoded using fixed-length big-endian encoding (7 bits used to avoid use of reserved bytes like 0xFF):
111119
* Data is "right-aligned", meaning padding is prepended to the first byte (and its MSB).
112120
* For example, the 32-bit float `29.9510 is` encoded as `0x26 0x37 0x3E 0x0F 0x04.` We get to this encoding by taking the IEEE 764 32-bit binary representation of the number 29.9510, (1) writing the least-significant 7 bits, (2) right-shifting 7 bits, and repeating the process until encoding the entire bit-string (5 times for a 32-bit float). As a result, 0x26 = 29.9510 & 0x7F, 0x37 = (29.9510 >> 7) & 0x7F, 0x3E = (29.9510 >> 14) & 0x7F, 0x0F = (29.9510 >> 21) & 0x7F, and 0x04 = (29.9510 >> 28) & 0x7F.
121+
* "Unused" bits in the last encoded byte should be handled as per earlier general note: left as `0`
113122
* "Big" decimal/integer values use "safe" binary encoding
114123
* "Safe" binary encoding simply uses 7 LSB (sign bit, MSB, is left as 0).
115124
* The last encoded byte contains 1 - 7 bits: if less than 7, data is "right-aligned", contained in Least-Significant Bits; there will be 0-6 MSB padding bits.
116125
* For example: when encoding 4 bytes (32 bits), the first full (7-bit) encoded bytes (`0vvvvvvv`) are followed by an incomplete byte containing 4 value bits: `0000vvvv`.
117-
* NOTE: before version 1.0.5 above statemet claimed incorrect alignment (claiming padding would be for LSB)
126+
* NOTE: before version 1.0.5 above statement claimed incorrect alignment (claiming padding would be for LSB)
127+
* "Unused" bits in the last encoded byte should be handled as per earlier general note: left as `0`
118128

119129
### Tokens: value mode
120130

@@ -161,6 +171,7 @@ Prefix: 0x20; covers byte values 0x20 - 0x3F, although not all values are used
161171
* 0x2A: `BigDecimal`
162172
* Encoded as token indicator followed by zigzag encoded scale (32-bit), followed by 7-bit escaped binary (with Unsigned VInt (no-zigzag encoding) as length indicator) that represent magnitude value (byte array) of integral part.
163173
* 0x2B - reserved for future use
174+
* Note that possible "unused" bits in the last encoded byte should be handled as per earlier general note: left as `0`, ignored on decoding.
164175
* Reserved for future use, avoided (decoding error if found)
165176
* 0x2C - 0x2F reserved for future use (non-overlapping with keys)
166177
* 0x30 - 0x3F overlapping with key mode and/or header (0x3A)

0 commit comments

Comments
 (0)