!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! !!!!! NOTE: DOCS GENERATED BY CLAUDE VIA EXAMINING CORE CODEBASE AND GATHERING DATA FROM QUERIES. HAS NOT BEEN VERIFIED BY A HUMAN. !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
The Native format is ClickHouse's columnar binary format designed for efficient data exchange. Unlike row-based formats (like RowBinary) where each row is serialized sequentially, Native organizes data by column - all values for the first column are written together, then all values for the second column, and so on.
Columnar Layout (Native):
Column1: [row1_val, row2_val, row3_val, ...]
Column2: [row1_val, row2_val, row3_val, ...]
...
Row-Based Layout (RowBinary):
Row1: [col1_val, col2_val, col3_val, ...]
Row2: [col1_val, col2_val, col3_val, ...]
...
The columnar layout enables efficient compression (similar values are adjacent), better CPU cache utilization, and aligns with ClickHouse's internal storage format.
The term "Native" is used in two related but distinct contexts:
-
Native Format: A binary wire format for serializing blocks of columnar data. This document specifies the Native format.
-
TCP Native Protocol: ClickHouse's proprietary client-server protocol that uses the Native format for data transfer, plus additional protocol messages for queries, progress, exceptions, etc.
Key Differences:
| Aspect | Native Format (HTTP) | TCP Native Protocol |
|---|---|---|
| Transport | HTTP POST body | TCP socket |
| BlockInfo | Not included | Included (when client_revision > 0) |
| Streaming | Single response | Multi-block streaming with progress |
| Authentication | HTTP headers | Protocol handshake |
| Use Case | Simple integrations | Native clients, high-performance |
When using HTTP with FORMAT Native, you receive pure Native format blocks without the TCP protocol overhead. The TCP protocol wraps Native format blocks with additional metadata like BlockInfo (see Section 3).
| Feature | Native | RowBinary |
|---|---|---|
| Layout | Columnar (all values per column together) | Row-based (all columns per row together) |
| Block Header | NumColumns, NumRows, column names & types | None (schema must be known) |
| Type Info | Embedded in stream | External (defined by INSERT or FORMAT clause) |
| Streaming | Block-at-a-time | Row-at-a-time |
| Memory | Higher (full column buffers) | Lower (single row) |
| Compression | Better (similar values adjacent) | Moderate |
| LowCardinality | Dictionary-encoded | Materialized values |
| Nullable | Null bitmap + values | Per-row null flags + values |
| Complex Types | Stream-based (Array sizes + elements) | Inline per-row |
| Best For | Bulk transfer, server-to-server | Simple clients, streaming |
Size Comparison Example:
For a column of 1000 LowCardinality(String) values with 10 unique strings:
- Native: Dictionary (10 strings) + 1000 index bytes ≈ small
- RowBinary: 1000 full strings ≈ much larger
For simple UInt64 data without repetition, both formats produce similar sizes.
The Native format organizes data into blocks. Each block is a self-contained unit with metadata followed by columnar data. When reading over HTTP (not using the TCP protocol with client_revision > 0), blocks do not include BlockInfo.
A block starts with two VarUInt values that specify its dimensions:
- NumColumns (VarUInt): Number of columns in the block
- NumRows (VarUInt): Number of rows in the block
VarUInt is an unsigned LEB128 (Little Endian Base 128) encoding where each byte uses 7 bits for data and 1 bit as a continuation flag. Values 0-127 encode as a single byte.
Example: Simple single-column, single-row block
curl -s -XPOST "http://localhost:8123?default_format=Native" \
--data-binary "SELECT 42::UInt32 AS num" | xxd00000000: 0101 036e 756d 0655 496e 7433 322a 0000 ...num.UInt32*..
00000010: 00 .
Breakdown:
const block = new Uint8Array([
0x01, // NumColumns = 1 (VarUInt)
0x01, // NumRows = 1 (VarUInt)
// Column 0:
0x03, 0x6e, 0x75, 0x6d, // Column name = "num" (String: length=3, bytes="num")
0x06, 0x55, 0x49, 0x6e, // Column type = "UInt32" (String: length=6, bytes="UInt32")
0x74, 0x33, 0x32, // (continuation of "UInt32")
0x2a, 0x00, 0x00, 0x00, // Column data: 42 as little-endian UInt32
]);Example: Two columns, one row
curl -s -XPOST "http://localhost:8123?default_format=Native" \
--data-binary "SELECT 'hello'::String AS msg, 100::UInt8 AS id" | xxd00000000: 0201 036d 7367 0653 7472 696e 6705 6865 ...msg.String.he
00000010: 6c6c 6f02 6964 0555 496e 7438 64 llo.id.UInt8d
Breakdown:
const block = new Uint8Array([
0x02, // NumColumns = 2 (VarUInt)
0x01, // NumRows = 1 (VarUInt)
// Column 0:
0x03, 0x6d, 0x73, 0x67, // Column name = "msg" (String)
0x06, 0x53, 0x74, 0x72, // Column type = "String" (String)
0x69, 0x6e, 0x67, // (continuation)
0x05, 0x68, 0x65, 0x6c, // Column data: "hello" (String: length=5)
0x6c, 0x6f, // (continuation)
// Column 1:
0x02, 0x69, 0x64, // Column name = "id" (String: length=2)
0x05, 0x55, 0x49, 0x6e, // Column type = "UInt8" (String: length=5)
0x74, 0x38, // (continuation)
0x64, // Column data: 100 as UInt8
]);Example: One column, three rows (columnar layout)
curl -s -XPOST "http://localhost:8123?default_format=Native" \
--data-binary "SELECT number::UInt64 AS n FROM numbers(3)" | xxd00000000: 0103 016e 0655 496e 7436 3400 0000 0000 ...n.UInt64.....
00000010: 0000 0001 0000 0000 0000 0002 0000 0000 ................
00000020: 0000 00 ...
Breakdown:
const block = new Uint8Array([
0x01, // NumColumns = 1 (VarUInt)
0x03, // NumRows = 3 (VarUInt)
// Column 0:
0x01, 0x6e, // Column name = "n" (String: length=1)
0x06, 0x55, 0x49, 0x6e, // Column type = "UInt64" (String: length=6)
0x74, 0x36, 0x34, // (continuation)
// Column data (all 3 rows, columnar):
0x00, 0x00, 0x00, 0x00, // Row 0: 0 as UInt64 (little-endian)
0x00, 0x00, 0x00, 0x00,
0x01, 0x00, 0x00, 0x00, // Row 1: 1 as UInt64 (little-endian)
0x00, 0x00, 0x00, 0x00,
0x02, 0x00, 0x00, 0x00, // Row 2: 2 as UInt64 (little-endian)
0x00, 0x00, 0x00, 0x00,
]);Each column in the block is serialized sequentially with the following structure:
- Column Name (String): VarUInt length + UTF-8 bytes
- Column Type (String or binary-encoded):
- By default: String representation (e.g.,
"UInt32","Nullable(String)") - If
output_format_native_encode_types_in_binary_format=1: Binary-encoded type structure
- By default: String representation (e.g.,
- Serialization Info (conditional, revision >= 54454):
- Present only when
client_revision >= DBMS_MIN_REVISION_WITH_CUSTOM_SERIALIZATION(54454) UInt8 has_custom: 0 or 1 indicating if custom serialization is used- If
has_custom == 1: Binary-encoded serialization kind stack (for sparse columns, etc.) - See Native Protocol Versions for the revision gates and kind-stack encoding
- Present only when
- Column Data (columnar, type-specific):
- All values for this column across all rows
- Serialization depends on the data type (see subsequent sections)
- If
NumRows == 0, no data bytes are written (even for variable-length types)
Empty block behavior:
When a query returns 0 rows, the entire response is empty (0 bytes). The block structure is only written when there is at least one row or when explicitly required by the protocol.
# Query with 0 rows produces empty output
curl -s -XPOST "http://localhost:8123?default_format=Native" \
--data-binary "SELECT number::UInt64 AS n FROM numbers(0)" | wc -c
# Output: 0BlockInfo contains additional metadata about the block, used primarily for distributed query execution and aggregation.
Historical note: when HTTP FORMAT Native is requested without client_protocol_version, blocks start directly with NumColumns / NumRows. However, the ClickHouse server source writes BlockInfo for HTTP Native too when client_protocol_version > 0. See Native Protocol Versions.
BlockInfo appears at the very beginning of each block, before the block dimensions, when:
client_revision > 0- the writer path is using the revision-aware Native block layout
With legacy HTTP Native (client_revision = 0 / parameter omitted), blocks start directly with NumColumns / NumRows.
BlockInfo uses a field-number-based binary format (similar to protobuf but simpler). The structure is:
[Field1_Num(VarUInt), Field1_Value(binary)]
[Field2_Num(VarUInt), Field2_Value(binary)]
...
[0x00] // Terminator byte
Each field is serialized only if the protocol revision supports it. The serialization always ends with a terminator byte (0x00).
BlockInfo defines three fields (as of protocol revision 54480+):
- Field Number: 1 (VarUInt)
- Type:
bool(UInt8: 0 = false, 1 = true) - Minimum Revision: 0 (always present if
client_revision > 0) - Purpose: Indicates if this block contains overflow rows from
GROUP BY ... WITH TOTALSwhenmax_rows_to_group_byis exceeded withgroup_by_overflow_mode = 'any'. Whentrue, this block contains aggregated values that didn't fit within the limit.
Encoding:
// If is_overflows = true:
const field1 = new Uint8Array([
0x01, // Field number 1 (VarUInt)
0x01, // is_overflows = true (bool as UInt8)
]);- Field Number: 2 (VarUInt)
- Type:
Int32(4 bytes, little-endian) - Minimum Revision: 0 (always present if
client_revision > 0) - Purpose: Used in two-level aggregation to indicate which bucket this block's data belongs to. When using two-level aggregation, data with different key groups are scattered across different buckets. This field helps optimize merging during distributed aggregation.
- Default Value: -1 (indicates no bucket assignment)
Encoding:
// If bucket_num = 5:
const field2 = new Uint8Array([
0x02, // Field number 2 (VarUInt)
0x05, 0x00, 0x00, 0x00, // bucket_num = 5 (Int32, little-endian)
]);- Field Number: 3 (VarUInt)
- Type:
std::vector<Int32>(VarUInt size + array of Int32 values) - Minimum Revision: 54480 (
DBMS_MIN_REVISION_WITH_OUT_OF_ORDER_BUCKETS_IN_AGGREGATION) - Purpose: List of bucket IDs that were delayed by
ConvertingAggregatedToChunksTransformon the current node. Used for coordinating out-of-order bucket processing in distributed aggregation. - Default Value: Empty vector
Encoding:
// Vector encoding: VarUInt(size) + elements
const writeVector = (values: Int32[]) => {
const size = values.length; // VarUInt
const elements = values.flatMap(v =>
[v & 0xFF, (v >> 8) & 0xFF, (v >> 16) & 0xFF, (v >> 24) & 0xFF]
);
return [size, ...elements];
};
// If out_of_order_buckets = [10, 20]:
const field3 = new Uint8Array([
0x03, // Field number 3 (VarUInt)
0x02, // Vector size = 2 (VarUInt)
0x0A, 0x00, 0x00, 0x00, // Element 0: 10 (Int32, little-endian)
0x14, 0x00, 0x00, 0x00, // Element 1: 20 (Int32, little-endian)
]);The BlockInfo serialization always ends with a terminator byte of 0x00 (VarUInt encoding of 0), indicating no more fields follow.
Here's a complete example of BlockInfo with all fields set (hypothetical, as this would only occur in TCP protocol context):
const blockInfo = new Uint8Array([
// Field 1: is_overflows = false
0x01, // Field number 1
0x00, // false
// Field 2: bucket_num = 3
0x02, // Field number 2
0x03, 0x00, 0x00, 0x00, // bucket_num = 3
// Field 3: out_of_order_buckets = [5, 7] (only if revision >= 54480)
0x03, // Field number 3
0x02, // Vector size = 2
0x05, 0x00, 0x00, 0x00, // 5
0x07, 0x00, 0x00, 0x00, // 7
// Terminator
0x00, // End of BlockInfo
// After BlockInfo, the block dimensions and columns follow:
// 0x01, 0x01, ... // NumColumns, NumRows, ...
]);When reading a Native format stream over the TCP protocol with server_revision > 0, the reader must:
- Read field numbers (VarUInt) in a loop
- For each non-zero field number, read the corresponding field value based on its known type
- Stop when field number
0x00is encountered - Then proceed to read the block dimensions (NumColumns, NumRows) and column data
Note: HTTP readers must not assume BlockInfo is absent. It is absent in the legacy default path, but present when HTTP FORMAT Native is requested with a positive client_protocol_version.
The Native format uses a stream-based architecture for complex data types. This is a conceptual model where certain types decompose their binary representation into multiple logical "streams" that are interleaved in the actual binary output. Understanding this concept is essential for correctly decoding complex types over HTTP.
In ClickHouse's serialization architecture, a "stream" is a logical subdivision of a column's binary data. Simple types like UInt32 or String use a single stream (called Regular), where all values are written contiguously. Complex types decompose into multiple named streams, each carrying a specific piece of structural information.
Important: When reading Native format over HTTP, streams are not separate physical byte sequences. They are conceptual divisions that appear sequentially interleaved in the binary output. The serialization process writes each stream's data in a specific order, and you must read them in that exact order.
The stream architecture exists because:
-
Columnar Storage Optimization: In ClickHouse's storage engine (MergeTree), these streams map to separate physical files on disk (e.g.,
column.binfor data,column.size0.binfor array sizes). This allows selective reading and efficient compression. -
Flexible Composition: Complex types can nest arbitrarily (e.g.,
Array(Array(LowCardinality(String)))). Each type layer adds its own streams to the path hierarchy without knowing about parent or child types. -
Shared Structures: Some streams (like null maps in
Nullableor dictionaries inLowCardinality) can be shared across granules or reused, reducing redundancy.
When you read Native format over HTTP:
- Each column's data is written by calling
serializeBinaryBulkWithMultipleStreams - The serialization code calls a "stream getter" function for each logical stream path
- All streams write to the same physical TCP/HTTP output buffer, one after another
- Your decoder must understand the stream order to parse the bytes correctly
Practical Effect: You don't read from multiple simultaneous streams. Instead, you read sequential chunks of binary data, where each chunk corresponds to a logical stream. The order is determined by the type's enumerateStreams method.
Each logical stream is identified by a SubstreamPath: a sequence of Substream components that describe the nesting structure. The path is built up as types compose.
Example SubstreamPath representations:
| Type | SubstreamPath Components | Description |
|---|---|---|
UInt32 |
{Regular} |
Single stream for values |
Array(UInt32) |
{ArraySizes} |
Array sizes (offsets) |
{ArrayElements, Regular} |
Element values | |
Nullable(String) |
{NullMap} |
Null indicator bytes |
{NullableElements, Regular} |
String values (including for NULLs) | |
Array(Nullable(String)) |
{ArraySizes} |
Array offsets |
{ArrayElements, NullMap} |
Null indicators for all elements | |
{ArrayElements, NullableElements, Regular} |
String data | |
LowCardinality(String) |
{DictionaryKeys} |
Unique dictionary values |
{DictionaryIndexes} |
Row-to-dictionary index mappings |
For Array(UInt32) with 3 rows: [1,2], [3], []:
Streams:
-
ArraySizes (stream path:
{ArraySizes}):- Contains array sizes for each row
- Written as
UInt64values - Bytes:
02 00 00 00 00 00 00 00(size 2),01 00 00 00 00 00 00 00(size 1),00 00 00 00 00 00 00 00(size 0)
-
ArrayElements (stream path:
{ArrayElements, Regular}):- Contains flattened element values: 1, 2, 3
- Written as contiguous
UInt32values - Bytes:
01 00 00 00(1),02 00 00 00(2),03 00 00 00(3)
Binary layout (streams written sequentially):
[ArraySizes: 24 bytes for 3 offsets] [ArrayElements: 12 bytes for 3 UInt32 values]
For LowCardinality(String) with 4 rows: "hello", "world", "hello", "world":
Streams:
-
DictionaryKeys (stream path:
{DictionaryKeys}):- Prefix: version marker (UInt64 in prefix stream)
- Dictionary values: unique strings
"hello","world"
-
DictionaryIndexes (stream path:
{DictionaryIndexes}):- Serialization type byte (encodes index type: UInt8/16/32/64, flags)
- Additional keys count (if any overflow keys)
- Row count (UInt64)
- Index values:
0, 1, 0, 1(indexes into dictionary)
For nested types, streams compose hierarchically:
Streams (in order):
{ArraySizes}- array offsets{ArrayElements, DictionaryKeys}- dictionary for all elements{ArrayElements, DictionaryIndexes}- indexes for all elements
The ArrayElements prefix indicates these streams belong to the array's elements, allowing proper reconstruction of the nested structure.
When deserializing, you must:
-
Follow the enumeration order: Types define their stream order via
enumerateStreams. This order is deterministic. -
Read each stream completely: Don't interleave reads between different streams of the same column. Read one stream, then the next.
-
Respect nesting: For
Array(T), first read array sizes, then recursively read the element stream(s) for typeT. -
Handle prefixes/suffixes: Some types write metadata in
serializeBinaryBulkStatePrefix(before data) andserializeBinaryBulkStateSuffix(after data). In Native format over HTTP, prefixes typically appear inline with the first granule's data.
In a block with multiple columns, the entire column (all its streams) is written before the next column:
Block:
Column 0: [all streams for column 0]
Column 1: [all streams for column 1]
Column 2: [all streams for column 2]
Within a column, streams appear in the order determined by enumerateStreams.
When decoding Native format over HTTP:
- Streams are conceptual, not physical: You read a single byte stream, but must mentally track which logical stream you're in
- Order is critical: Stream order is type-specific and deterministic. Follow the patterns shown above.
- Nested types compose paths: Each nesting level adds components to the stream path (e.g.,
ArrayElements→NullableElements→Regular) - Start with simple types: Implement
ArrayandNullablefirst to understand the pattern, then tackleLowCardinality,Map,Tuple, etc.
Understanding streams as "logical subdivisions written sequentially" rather than "parallel data channels" is the key mental model for successfully implementing a Native format decoder.
These types have the same columnar encoding as their RowBinary counterparts (values written consecutively for all rows).
Integer types are serialized as fixed-width little-endian values. For a column with N rows, N consecutive integer values are written.
Supported types:
Int8/UInt8: 1 byte per valueInt16/UInt16: 2 bytes per value (little-endian)Int32/UInt32: 4 bytes per value (little-endian)Int64/UInt64: 8 bytes per value (little-endian)Int128/UInt128: 16 bytes per value (little-endian)Int256/UInt256: 32 bytes per value (little-endian)
Example: UInt32 single value
curl -s -XPOST "http://localhost:8123?default_format=Native" \
--data-binary "SELECT 42::UInt32 AS col" | xxd00000000: 0101 0363 6f6c 0655 496e 7433 322a 0000 ...col.UInt32*..
00000010: 00 .
Breakdown:
const block = new Uint8Array([
0x01, // NumColumns = 1
0x01, // NumRows = 1
0x03, 0x63, 0x6f, 0x6c, // Column name = "col"
0x06, 0x55, 0x49, 0x6e, // Column type = "UInt32"
0x74, 0x33, 0x32,
0x2a, 0x00, 0x00, 0x00, // Data: 42 (little-endian UInt32)
]);Example: Multiple rows (columnar)
curl -s -XPOST "http://localhost:8123?default_format=Native" \
--data-binary "SELECT number::UInt8 AS col FROM numbers(3)" | xxd00000000: 0103 0363 6f6c 0555 496e 7438 0001 02 ...col.UInt8...
The column data contains three consecutive UInt8 values: 0x00, 0x01, 0x02.
Example: Int8 negative value
curl -s -XPOST "http://localhost:8123?default_format=Native" \
--data-binary "SELECT -1::Int8 AS col" | xxd00000000: 0101 0363 6f6c 0449 6e74 38ff ...col.Int8.
The value -1 is encoded as 0xff (two's complement representation).
Example: Int128 large value
curl -s -XPOST "http://localhost:8123?default_format=Native" \
--data-binary "SELECT 170141183460469231731687303715884105727::Int128 AS col" | xxd00000000: 0101 0363 6f6c 0649 6e74 3132 38ff ffff ...col.Int128...
00000010: ffff ffff ffff ffff ffff ffff 7f .............
The 16-byte value is little-endian: 0x7fffffffffffffffffffffffffffffff.
Example: UInt256
curl -s -XPOST "http://localhost:8123?default_format=Native" \
--data-binary "SELECT 1::UInt256 AS col" | xxd00000000: 0101 0363 6f6c 0755 496e 7432 3536 0100 ...col.UInt256..
00000010: 0000 0000 0000 0000 0000 0000 0000 0000 ................
00000020: 0000 0000 0000 0000 0000 0000 00 .............
The 32-byte value is little-endian with the value 1 in the first byte.
Floating point types use IEEE 754 standard encoding in little-endian byte order.
Supported types:
Float32: 4 bytes, IEEE 754 single-precisionFloat64: 8 bytes, IEEE 754 double-precisionBFloat16: 2 bytes, brain floating point (16-bit truncation of Float32)
Example: Float32
curl -s -XPOST "http://localhost:8123?default_format=Native" \
--data-binary "SELECT 1.25::Float32 AS col" | xxd00000000: 0101 0363 6f6c 0746 6c6f 6174 3332 0000 ...col.Float32..
00000010: a03f .?
The value 1.25 is encoded as 0x3fa00000 (little-endian IEEE 754).
Example: Float64
curl -s -XPOST "http://localhost:8123?default_format=Native" \
--data-binary "SELECT 3.14159265358979::Float64 AS col" | xxd00000000: 0101 0363 6f6c 0746 6c6f 6174 3634 112d ...col.Float64.-
00000010: 4454 fb21 0940 DT.!.@
The value π is encoded as 0x400921fb54442d11 (little-endian IEEE 754 double).
Example: BFloat16
curl -s -XPOST "http://localhost:8123?default_format=Native" \
--data-binary "SELECT 1.25::BFloat16 AS col" | xxd00000000: 0101 0363 6f6c 0842 466c 6f61 7431 36a0 ...col.BFloat16.
00000010: 3f ?
BFloat16 stores the top 16 bits of the Float32 representation: 0x3fa0 (little-endian). This corresponds to the first 2 bytes of the Float32 encoding 0x3fa00000, reversed to little-endian order.
The Bool type is encoded as a single byte per value:
0x00= false0x01= true
Example: true
curl -s -XPOST "http://localhost:8123?default_format=Native" \
--data-binary "SELECT true::Bool AS col" | xxd00000000: 0101 0363 6f6c 0442 6f6f 6c01 ...col.Bool.
Example: false
curl -s -XPOST "http://localhost:8123?default_format=Native" \
--data-binary "SELECT false::Bool AS col" | xxd00000000: 0101 0363 6f6c 0442 6f6f 6c00 ...col.Bool.
String: Each string value is encoded as a VarUInt length followed by the UTF-8 bytes.
- VarUInt length: Number of bytes in the string (LEB128 encoding)
- Bytes: Raw UTF-8 data
Example: String
curl -s -XPOST "http://localhost:8123?default_format=Native" \
--data-binary "SELECT 'hello'::String AS col" | xxd00000000: 0101 0363 6f6c 0653 7472 696e 6705 6865 ...col.String.he
00000010: 6c6c 6f llo
The string "hello" is encoded as:
0x05: Length = 5 bytes (VarUInt)0x68 0x65 0x6c 0x6c 0x6f: "hello" in UTF-8
Example: Empty string
curl -s -XPOST "http://localhost:8123?default_format=Native" \
--data-binary "SELECT ''::String AS col" | xxd00000000: 0101 0363 6f6c 0653 7472 696e 6700 ...col.String.
Empty string is encoded as 0x00 (VarUInt length = 0).
Example: Multiple strings (columnar)
curl -s -XPOST "http://localhost:8123?default_format=Native" \
--data-binary "SELECT arrayJoin(['hello', 'world']) AS col" | xxd00000000: 0102 0363 6f6c 0653 7472 696e 6705 6865 ...col.String.he
00000010: 6c6c 6f05 776f 726c 64 llo.world
Each string is serialized consecutively:
- Row 0:
0x05+ "hello" - Row 1:
0x05+ "world"
Example: Long string with multi-byte VarUInt
curl -s -XPOST "http://localhost:8123?default_format=Native" \
--data-binary "SELECT repeat('x', 300) AS col" | xxd | head -500000000: 0101 0363 6f6c 0653 7472 696e 67ac 0278 ...col.String..x
00000010: 7878 7878 7878 7878 7878 7878 7878 7878 xxxxxxxxxxxxxxxx
00000020: 7878 7878 7878 7878 7878 7878 7878 7878 xxxxxxxxxxxxxxxx
00000030: 7878 7878 7878 7878 7878 7878 7878 7878 xxxxxxxxxxxxxxxx
00000040: 7878 7878 7878 7878 7878 7878 7878 7878 xxxxxxxxxxxxxxxx
The length is encoded as 0xac 0x02:
- Byte 1:
0xac=0b10101100→ 7 data bits =0b0101100, continue bit set - Byte 2:
0x02=0b00000010→ 7 data bits =0b0000010, continue bit not set - Value =
(0b0101100) | (0b0000010 << 7)=44 | 256=300
FixedString(N): Each value occupies exactly N bytes, zero-padded if the string is shorter.
Example: FixedString(5)
curl -s -XPOST "http://localhost:8123?default_format=Native" \
--data-binary "SELECT 'abc'::FixedString(5) AS col" | xxd00000000: 0101 0363 6f6c 0e46 6978 6564 5374 7269 ...col.FixedStri
00000010: 6e67 2835 2961 6263 0000 ng(5)abc..
The value "abc" is padded to 5 bytes: 0x61 0x62 0x63 0x00 0x00.
Date and time types are encoded as integer values representing specific time units since their respective epochs.
Date: UInt16 representing days since 1970-01-01.
Example: Date
curl -s -XPOST "http://localhost:8123?default_format=Native" \
--data-binary "SELECT toDate('2023-12-25') AS col" | xxd00000000: 0101 0363 6f6c 0444 6174 6504 4d ...col.Date.M
The value is 0x4d04 (little-endian UInt16) = 19716 days since 1970-01-01.
Example: Date epoch
curl -s -XPOST "http://localhost:8123?default_format=Native" \
--data-binary "SELECT toDate('1970-01-01') AS col" | xxd00000000: 0101 0363 6f6c 0444 6174 6500 00 ...col.Date..
Epoch value is 0x0000 (0 days).
Date32: Int32 representing days since 1900-01-01 (supports dates before 1970).
Example: Date32
curl -s -XPOST "http://localhost:8123?default_format=Native" \
--data-binary "SELECT toDate32('2023-12-25') AS col" | xxd00000000: 0101 0363 6f6c 0644 6174 6533 3204 4d00 ...col.Date32.M.
00000010: 00 .
The value is 0x00004d04 (little-endian Int32) = 45284 days since 1900-01-01.
Example: Date32 epoch (1900-01-01)
curl -s -XPOST "http://localhost:8123?default_format=Native" \
--data-binary "SELECT toDate32('1900-01-01') AS col" | xxd00000000: 0101 0363 6f6c 0644 6174 6533 3221 9cff ...col.Date32!..
00000010: ff .
The value is 0xffff9c21 (little-endian Int32) = -25567 days. Despite the documentation saying "days since 1900-01-01", the actual epoch is 1970-01-01 (value 0). Dates before 1970 are represented as negative values. The date 1900-01-01 is -25567 days before 1970-01-01.
DateTime: UInt32 representing seconds since Unix epoch (1970-01-01 00:00:00 UTC).
Example: DateTime
curl -s -XPOST "http://localhost:8123?default_format=Native" \
--data-binary "SELECT toDateTime('2023-12-25 10:30:45') AS col" | xxd00000000: 0101 0363 6f6c 0844 6174 6554 696d 65d5 ...col.DateTime.
00000010: 5989 65 Y.e
The value is 0x658959d5 (little-endian UInt32) = 1703502645 seconds since epoch.
Example: DateTime epoch
curl -s -XPOST "http://localhost:8123?default_format=Native" \
--data-binary "SELECT toDateTime(0) AS col" | xxd00000000: 0101 0363 6f6c 0844 6174 6554 696d 6500 ...col.DateTime.
00000010: 0000 00 ...
Epoch value is 0x00000000.
DateTime64(P): Int64 representing time units since epoch with precision P (0-9).
- Precision 0: seconds
- Precision 3: milliseconds
- Precision 6: microseconds
- Precision 9: nanoseconds
The type string includes the precision: DateTime64(3).
Example: DateTime64(3) - milliseconds
curl -s -XPOST "http://localhost:8123?default_format=Native" \
--data-binary "SELECT toDateTime64('2023-12-25 10:30:45.123', 3) AS col" | xxd00000000: 0101 0363 6f6c 0d44 6174 6554 696d 6536 ...col.DateTime6
00000010: 3428 3329 83e8 86a0 8c01 0000 4(3)........
The value is 0x00000001a086e883 (little-endian Int64) = 1703502645123 milliseconds since epoch.
Example: DateTime64 epoch
curl -s -XPOST "http://localhost:8123?default_format=Native" \
--data-binary "SELECT toDateTime64(0, 3) AS col" | xxd00000000: 0101 0363 6f6c 0d44 6174 6554 696d 6536 ...col.DateTime6
00000010: 3428 3329 0000 0000 0000 0000 4(3)........
Epoch value is 0x0000000000000000.
Decimal types store fixed-point numbers as little-endian integers scaled by a power of 10. The scale (number of decimal places) is encoded in the type string.
Supported types:
Decimal32(S)/Decimal(P, S)where P ≤ 9: 4 bytes (Int32)Decimal64(S)/Decimal(P, S)where P ≤ 18: 8 bytes (Int64)Decimal128(S)/Decimal(P, S)where P ≤ 38: 16 bytes (Int128)Decimal256(S)/Decimal(P, S)where P ≤ 76: 32 bytes (Int256)
The stored value equals the decimal number multiplied by 10^S.
Example: Decimal32(2)
curl -s -XPOST "http://localhost:8123?default_format=Native" \
--data-binary "SELECT toDecimal32(123.45, 2) AS col" | xxd00000000: 0101 0363 6f6c 0d44 6563 696d 616c 2839 ...col.Decimal(9
00000010: 2c20 3229 3090 0000 , 2)0...
The type string in the wire format is always Decimal(P, S) notation (e.g., Decimal(9, 2)), even when the SQL uses Decimal32(S). The precision 9 indicates this is a Decimal32 (P ≤ 9 → 4 bytes).
The value 123.45 is stored as 12345 (123.45 × 10^2).
Encoded as 0x00003090 (little-endian Int32) = 12345.
Example: Decimal64(4)
curl -s -XPOST "http://localhost:8123?default_format=Native" \
--data-binary "SELECT toDecimal64(123.45, 4) AS col" | xxd00000000: 0101 0363 6f6c 0e44 6563 696d 616c 2831 ...col.Decimal(1
00000010: 382c 2034 2944 d612 0000 0000 00 8, 4)D.......
The type string is Decimal(18, 4).
The value 123.45 is stored as 1234500 (123.45 × 10^4).
Encoded as 0x0000000012d644 (little-endian Int64) = 1234500.
Example: Decimal128(6)
curl -s -XPOST "http://localhost:8123?default_format=Native" \
--data-binary "SELECT toDecimal128(123.45, 6) AS col" | xxd00000000: 0101 0363 6f6c 0e44 6563 696d 616c 2833 ...col.Decimal(3
00000010: 382c 2036 2990 b25b 0700 0000 0000 0000 8, 6)..[........
00000020: 0000 0000 00 .....
The type string is Decimal(38, 6).
The value 123.45 is stored as 123450000 (123.45 × 10^6).
Encoded as 0x00000000000000000000000007 5bb290 (little-endian Int128) = 123450000.
Example: Decimal256(8)
curl -s -XPOST "http://localhost:8123?default_format=Native" \
--data-binary "SELECT toDecimal256(123.45, 8) AS col" | xxd00000000: 0101 0363 6f6c 0e44 6563 696d 616c 2837 ...col.Decimal(7
00000010: 362c 2038 2940 c0d1 df02 0000 0000 0000 6, 8)@..........
00000020: 0000 0000 0000 0000 0000 0000 0000 0000 ................
00000030: 0000 0000 00 .....
The type string is Decimal(76, 8).
The value 123.45 is stored as 12345000000 (123.45 × 10^8).
Encoded as 32-byte little-endian Int256 = 12345000000.
UUID values are encoded as 16 bytes in a specific byte order that differs from the standard UUID string representation.
ClickHouse stores UUIDs in a format optimized for indexing, where bytes are reordered from the standard UUID string format xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx.
Byte order: The UUID is stored with the first and second groups swapped compared to RFC 4122 string representation.
Example: UUID
curl -s -XPOST "http://localhost:8123?default_format=Native" \
--data-binary "SELECT toUUID('61f0c404-5cb3-11e7-907b-a6006ad3dba0') AS col" | xxd00000000: 0101 0363 6f6c 0455 5549 44e7 11b3 5c04 ...col.UUID...\.
00000010: c4f0 61a0 dbd3 6a00 a67b 90 ..a...j..{.
The UUID string 61f0c404-5cb3-11e7-907b-a6006ad3dba0 is encoded as:
e7 11 b3 5c 04 c4 f0 61 a0 db d3 6a 00 a6 7b 90
This represents the bytes reordered as:
- Bytes 6-7 of string (
11e7) → bytes 0-1 of storage - Bytes 4-5 of string (
5cb3) → bytes 2-3 of storage - Bytes 0-3 of string (
61f0c404) → bytes 4-7 of storage - Bytes 8-15 of string (
907ba6006ad3dba0) → bytes 8-15 of storage
The reordering optimizes sorting and range queries on time-based UUIDs (version 1).
IPv4: Stored as 4 bytes (UInt32) in reversed byte order from the standard dotted-decimal notation.
Example: IPv4
curl -s -XPOST "http://localhost:8123?default_format=Native" \
--data-binary "SELECT toIPv4('192.168.1.1') AS col" | xxd00000000: 0101 0363 6f6c 0449 5076 3401 01a8 c0 ...col.IPv4....
The address 192.168.1.1 is encoded as 0x0101a8c0:
- Byte 0:
0x01(last octet: 1) - Byte 1:
0x01(third octet: 1) - Byte 2:
0xa8(second octet: 168) - Byte 3:
0xc0(first octet: 192)
This is little-endian UInt32 interpretation of the reversed IP bytes.
Example: IPv4 localhost
curl -s -XPOST "http://localhost:8123?default_format=Native" \
--data-binary "SELECT toIPv4('127.0.0.1') AS col" | xxd00000000: 0101 0363 6f6c 0449 5076 3401 0000 7f ...col.IPv4....
The address 127.0.0.1 is encoded as 0x0100007f (reversed bytes).
Example: Multiple IPv4 values
curl -s -XPOST "http://localhost:8123?default_format=Native" \
--data-binary "SELECT toIPv4('192.168.' || toString(number) || '.1') AS col FROM numbers(3)" | xxd00000000: 0103 0363 6f6c 0449 5076 3401 00a8 c001 ...col.IPv4.....
00000010: 01a8 c001 02a8 c0 .......
Three consecutive 4-byte values for 192.168.0.1, 192.168.1.1, 192.168.2.1.
IPv6: Stored as 16 bytes in network byte order (big-endian).
Example: IPv6
curl -s -XPOST "http://localhost:8123?default_format=Native" \
--data-binary "SELECT toIPv6('2001:db8::1') AS col" | xxd00000000: 0101 0363 6f6c 0449 5076 3620 010d b800 ...col.IPv6 ....
00000010: 0000 0000 0000 0000 0000 01 ...........
The address 2001:db8::1 is encoded as 16 bytes in network byte order:
20 01 0d b8 00 00 00 00 00 00 00 00 00 00 00 01
Example: IPv6 localhost
curl -s -XPOST "http://localhost:8123?default_format=Native" \
--data-binary "SELECT toIPv6('::1') AS col" | xxd00000000: 0101 0363 6f6c 0449 5076 3600 0000 0000 ...col.IPv6.....
00000010: 0000 0000 0000 0000 0000 01 ...........
The IPv6 loopback address ::1 is 15 zero bytes followed by 0x01.
Enum types are encoded as integer values representing the enum value's assigned number. The enum definition is included in the type string.
Enum8: Stored as 1 byte (Int8) per value.
Example: Enum8
curl -s -XPOST "http://localhost:8123?default_format=Native" \
--data-binary "SELECT CAST('a', 'Enum8(''a''=1, ''b''=2)') AS col" | xxd00000000: 0101 0363 6f6c 1745 6e75 6d38 2827 6127 ...col.Enum8('a'
00000010: 203d 2031 2c20 2762 2720 3d20 3229 01 = 1, 'b' = 2).
Breakdown:
- Type string:
Enum8('a' = 1, 'b' = 2)(23 bytes) - Data:
0x01(the assigned value for 'a')
The enum value 'a' is encoded as its assigned integer 1.
Enum16: Stored as 2 bytes (Int16, little-endian) per value.
Example: Enum16
curl -s -XPOST "http://localhost:8123?default_format=Native" \
--data-binary "SELECT CAST('hello', 'Enum16(''hello''=1000, ''world''=2000)') AS col" | xxd00000000: 0101 0363 6f6c 2645 6e75 6d31 3628 2768 ...col&Enum16('h
00000010: 656c 6c6f 2720 3d20 3130 3030 2c20 2777 ello' = 1000, 'w
00000020: 6f72 6c64 2720 3d20 3230 3030 29e8 03 orld' = 2000)..
Breakdown:
- Type string:
Enum16('hello' = 1000, 'world' = 2000)(38 bytes) - Data:
0x03e8(little-endian Int16) = 1000
The enum value 'hello' is encoded as its assigned integer 1000 in little-endian format.
Nullable(T) wraps any type T to allow NULL values. It uses a two-stream architecture:
- NullMap Stream: Null indicator bytes (1 byte per row)
- NullableElements Stream: The nested type's data (all rows, including NULLs)
Stream Layout:
The serialization writes two streams sequentially:
[NullMap: N bytes] [NestedData: serialized as type T]
Null Map Encoding:
For N rows, N consecutive bytes are written:
0x00= non-NULL (value is valid)0x01= NULL (value should be ignored)
Important: Even for NULL rows, the nested type still writes placeholder data. When deserializing, if the null map indicates NULL, the corresponding value in the nested data should be ignored and replaced with NULL.
Example: Nullable(UInt64) with mixed NULL values
curl -s -XPOST "http://localhost:8123?default_format=Native" \
--data-binary "SELECT if(number % 2 = 0, number, NULL)::Nullable(UInt64) AS col FROM numbers(5)" | xxd00000000: 0105 0363 6f6c 104e 756c 6c61 626c 6528 ...col.Nullable(
00000010: 5549 6e74 3634 2900 0100 0100 0000 0000 UInt64).........
00000020: 0000 0000 0100 0000 0000 0000 0200 0000 ................
00000030: 0000 0000 0300 0000 0000 0000 0400 0000 ................
00000040: 0000 0000 ....
Breakdown:
const block = new Uint8Array([
0x01, // NumColumns = 1
0x05, // NumRows = 5
// Column metadata
0x03, 0x63, 0x6f, 0x6c, // Column name = "col" (String: length=3)
0x10, 0x4e, 0x75, 0x6c, // Column type = "Nullable(UInt64)" (String: length=16)
0x6c, 0x61, 0x62, 0x6c,
0x65, 0x28, 0x55, 0x49,
0x6e, 0x74, 0x36, 0x34,
0x29,
// Stream 1: NullMap (5 bytes for 5 rows)
0x00, // Row 0: not NULL (value = 0)
0x01, // Row 1: NULL
0x00, // Row 2: not NULL (value = 2)
0x01, // Row 3: NULL
0x00, // Row 4: not NULL (value = 4)
// Stream 2: NullableElements (5 x UInt64 values)
// Note: All 5 values are written, even for NULL rows
0x00, 0x00, 0x00, 0x00, // Row 0: 0 (valid, NullMap[0]=0x00)
0x00, 0x00, 0x00, 0x00,
0x01, 0x00, 0x00, 0x00, // Row 1: 1 (IGNORED, NullMap[1]=0x01)
0x00, 0x00, 0x00, 0x00,
0x02, 0x00, 0x00, 0x00, // Row 2: 2 (valid, NullMap[2]=0x00)
0x00, 0x00, 0x00, 0x00,
0x03, 0x00, 0x00, 0x00, // Row 3: 3 (IGNORED, NullMap[3]=0x01)
0x00, 0x00, 0x00, 0x00,
0x04, 0x00, 0x00, 0x00, // Row 4: 4 (valid, NullMap[4]=0x00)
0x00, 0x00, 0x00, 0x00,
]);
// Resulting column values: [0, NULL, 2, NULL, 4]Example: All NULL values
curl -s -XPOST "http://localhost:8123?default_format=Native" \
--data-binary "SELECT NULL::Nullable(UInt8) AS col FROM numbers(3)" | xxd00000000: 0103 0363 6f6c 0f4e 756c 6c61 626c 6528 ...col.Nullable(
00000010: 5549 6e74 3829 0101 0100 0000 UInt8)......
Breakdown:
- NullMap:
0x01 0x01 0x01(all three rows are NULL) - Data:
0x00 0x00 0x00(placeholder UInt8 values, all ignored)
Example: Nullable(String)
curl -s -XPOST "http://localhost:8123?default_format=Native" \
--data-binary "SELECT if(number = 1, NULL, toString(number))::Nullable(String) AS col FROM numbers(3)" | xxd00000000: 0103 0363 6f6c 0f4e 756c 6c61 626c 6528 ...col.Nullable(
00000010: 5374 7269 6e67 2900 0100 0130 0030 0132 String)....0.0.2
Breakdown:
- NullMap:
0x00 0x01 0x00(rows 0 and 2 are valid, row 1 is NULL) - Data:
- Row 0:
0x01 0x30(String "0", valid) - Row 1:
0x00(empty string placeholder, IGNORED) - Row 2:
0x01 0x32(String "2", valid)
- Row 0:
Nested Nullable Types:
For complex types like Array(Nullable(T)), the stream paths compose hierarchically:
{ArraySizes}- array offsets{ArrayElements, NullMap}- null indicators for all array elements{ArrayElements, NullableElements, Regular}- element values
LowCardinality(T) is a dictionary-encoded column type that stores unique values once in a dictionary and uses integer indexes to reference them. This dramatically reduces storage and memory usage for columns with low cardinality (few distinct values).
Stream Architecture:
LowCardinality uses two main streams that are written sequentially:
- DictionaryKeys Stream: Dictionary metadata (version)
- DictionaryIndexes Stream: Index metadata, dictionary values (as "additional keys"), and row indexes
Complete Binary Structure:
For Native format over HTTP, the typical structure is:
// DictionaryKeys Stream:
[KeysVersion (UInt64, little-endian)]
// DictionaryIndexes Stream:
[IndexesSerializationType (UInt64, little-endian)]
[NumAdditionalKeys (UInt64, little-endian)] (if HasAdditionalKeysBit flag is set)
[AdditionalKeys data] (if HasAdditionalKeysBit flag is set)
[NumRows (UInt64, little-endian)]
[Index values: N integers]
DictionaryKeys Stream:
- KeysVersion (UInt64, little-endian): Serialization version
- Always
0x01=SharedDictionariesWithAdditionalKeys
- Always
In Native format over HTTP, the dictionary is typically sent as "additional keys" in the DictionaryIndexes stream rather than in a global dictionary here, because low_cardinality_use_single_dictionary_for_part is enabled by default.
DictionaryIndexes Stream:
1. IndexesSerializationType (UInt64, little-endian):
Encodes the index integer type and flags:
Bits 0-7 (Index Type after flag masking):
0x00= UInt8 indexes0x01= UInt16 indexes0x02= UInt32 indexes0x03= UInt64 indexes
Flag Bits:
- Bit 8 (
0x0100):NeedGlobalDictionaryBit- Read dictionary from separate global dictionary (0 for Native HTTP) - Bit 9 (
0x0200):HasAdditionalKeysBit- Additional keys follow - Bit 10 (
0x0400):NeedUpdateDictionary- Dictionary changed (typically 1 for Native HTTP)
Decoding:
The IndexesSerializationType is a UInt64 where:
- Bits 0-7 contain the index type (0-3)
- Bits 8-10 contain the flags
uint64_t value = read_uint64_le();
uint8_t index_type = value & 0xFF; // Extract bits 0-7 (type is 0-3)
bool has_additional_keys = (value >> 9) & 1; // Bit 9
bool need_global_dictionary = (value >> 8) & 1; // Bit 8
bool need_update_dictionary = (value >> 10) & 1; // Bit 10Note: The index type uses only bits 0-1 (values 0-3), but the full byte is reserved.
2. Additional Keys (conditional):
If HasAdditionalKeysBit is set:
- NumAdditionalKeys (UInt64, little-endian): Number of dictionary entries
- Keys Data: Serialized using inner type's bulk serialization
- For
String: sequence of (VarUInt length + UTF-8 bytes) - For
UInt32: sequence of UInt32 values - etc.
- For
3. Row Indexes:
- NumRows (UInt64, little-endian): Number of rows
- Index Values: NumRows consecutive integers (type from IndexesSerializationType)
- Each index references a position in the dictionary
- Index semantics depend on flags:
- If
!need_global_dictionary: index directly into additional_keys - If
need_global_dictionary: index < dict_size → global dict, index >= dict_size → additional_keys[index - dict_size]
- If
Example: LowCardinality(String) with repeated values
curl -s -XPOST "http://localhost:8123?default_format=Native" \
--data-binary "SELECT toLowCardinality(toString(number % 3)) AS col FROM numbers(6)" | xxd00000000: 0106 0363 6f6c 164c 6f77 4361 7264 696e ...col.LowCardin
00000010: 616c 6974 7928 5374 7269 6e67 2901 0000 ality(String)...
00000020: 0000 0000 0000 0600 0000 0000 0004 0000 ................
00000030: 0000 0000 0000 0130 0131 0132 0600 0000 .......0.1.2....
00000040: 0000 0000 0102 0301 0203 ..........
Breakdown:
const block = new Uint8Array([
// Block header
0x01, // NumColumns = 1
0x06, // NumRows = 6
// Column metadata
0x03, 0x63, 0x6f, 0x6c, // Column name = "col"
0x16, 0x4c, 0x6f, 0x77, 0x43, 0x61, 0x72, 0x64, 0x69, // Type = "LowCardinality(String)"
0x6e, 0x61, 0x6c, 0x69, 0x74, 0x79, 0x28, 0x53,
0x74, 0x72, 0x69, 0x6e, 0x67, 0x29,
// DictionaryKeys stream:
0x01, 0x00, 0x00, 0x00, // KeysVersion = 1 (UInt64)
0x00, 0x00, 0x00, 0x00,
// DictionaryIndexes stream:
0x06, 0x00, 0x00, 0x00, // IndexesSerializationType = 0x06 (UInt64)
0x00, 0x00, 0x00, 0x00, // Exact interpretation requires flag decoding
// Empirically: appears to encode type + flags
0x04, 0x00, 0x00, 0x00, // NumAdditionalKeys = 4 (UInt64)
0x00, 0x00, 0x00, 0x00, // Note: 4 keys for 3 distinct values suggests
// index 0 may be reserved (e.g., for NULL)
// Additional keys (String values):
0x00, // Key[0]: empty string (length=0)
0x01, 0x30, // Key[1]: "0" (length=1, data=0x30)
0x01, 0x31, // Key[2]: "1" (length=1, data=0x31)
0x01, 0x32, // Key[3]: "2" (length=1, data=0x32)
0x06, 0x00, 0x00, 0x00, // NumRows = 6 (UInt64)
0x00, 0x00, 0x00, 0x00,
// Row indexes (6 values):
0x01, // Row 0: index 1 → "0" (number % 3 = 0)
0x02, // Row 1: index 2 → "1" (number % 3 = 1)
0x03, // Row 2: index 3 → "2" (number % 3 = 2)
0x01, // Row 3: index 1 → "0" (number % 3 = 0)
0x02, // Row 4: index 2 → "1" (number % 3 = 1)
0x03, // Row 5: index 3 → "2" (number % 3 = 2)
]);
// Result: column values = ["0", "1", "2", "0", "1", "2"]Key Observations:
-
Index 0 Convention: The dictionary often includes a 0-index entry (here, an empty string) that may serve as a default or NULL placeholder. Actual data indexes start at 1.
-
Dictionary Compression: Only 3 distinct values ("0", "1", "2") are stored, even though there are 6 rows. The indexes array (6 bytes) plus dictionary (6 bytes total for keys) is much smaller than 6 separate string serializations would be.
-
Index Type Selection: The index type (UInt8/16/32/64) is chosen based on the dictionary size to minimize space.
Example: Simple LowCardinality(String) without repetition
curl -s -XPOST "http://localhost:8123?default_format=Native" \
--data-binary "SELECT toLowCardinality(toString(number)) AS col FROM numbers(3)" | xxd00000000: 0103 0363 6f6c 164c 6f77 4361 7264 696e ...col.LowCardin
00000010: 616c 6974 7928 5374 7269 6e67 2901 0000 ality(String)...
00000020: 0000 0000 0000 0600 0000 0000 0004 0000 ................
00000030: 0000 0000 0000 0130 0131 0132 0300 0000 .......0.1.2....
00000040: 0000 0000 0102 03 ....
Breakdown:
const block = new Uint8Array([
0x01, 0x03, // 1 column, 3 rows
0x03, 0x63, 0x6f, 0x6c, // Column name = "col"
0x16, /* "LowCardinality(String)" */,
// DictionaryKeys:
0x01, 0x00, 0x00, 0x00, // KeysVersion = 1
0x00, 0x00, 0x00, 0x00,
// DictionaryIndexes:
0x06, 0x00, 0x00, 0x00, // IndexesSerializationType
0x00, 0x00, 0x00, 0x00,
0x04, 0x00, 0x00, 0x00, // NumAdditionalKeys = 4
0x00, 0x00, 0x00, 0x00,
0x00, // Key[0]: "" (empty/default)
0x01, 0x30, // Key[1]: "0"
0x01, 0x31, // Key[2]: "1"
0x01, 0x32, // Key[3]: "2"
0x03, 0x00, 0x00, 0x00, // NumRows = 3
0x00, 0x00, 0x00, 0x00,
0x01, 0x02, 0x03, // Indexes: [1, 2, 3] → ["0", "1", "2"]
]);Nested LowCardinality:
For complex types like Array(LowCardinality(String)), the stream paths compose:
{ArraySizes}- array offsets{ArrayElements, DictionaryKeys}- dictionary for all array elements{ArrayElements, DictionaryIndexes}- indexes for all array elements
Additional Keys vs. Global Dictionary:
- Global Dictionary Mode (
need_global_dictionary=true): Dictionary is shared across granules, written once in DictionaryKeys stream or at end in suffix - Additional Keys Mode (
has_additional_keys=true): Dictionary sent inline with indexes stream, common in Native HTTP format
For most queries, Native HTTP format uses additional keys mode, as each HTTP response is self-contained.
The Array type uses a two-stream architecture to serialize variable-length arrays efficiently in columnar format:
- ArraySizes stream: Contains the size of each array (serialized as
UInt64per row) - ArrayElements stream: Contains all elements from all arrays, flattened consecutively
This design allows the array structure (sizes) to be separated from the element data, enabling efficient compression and selective reading in storage engines.
Stream Enumeration Order:
For Array(T), the streams appear in this order:
- Stream path
{ArraySizes}: Array sizes (oneUInt64per row) - Stream path
{ArrayElements, ...}: Element data (recursively serialized according to typeT)
Serialization Details:
The array sizes are written as cumulative offsets in the Native format. Each offset is a UInt64 value in little-endian format that represents the total number of elements from all arrays up to and including the current row.
The element data follows immediately after all array sizes, with all elements from all arrays written consecutively. For example, if you have 3 arrays [1,2], [3], [], the offsets would be 2, 3, 3 and the element stream would contain just 1, 2, 3 (3 elements total).
Example: Simple Array(UInt32)
curl -s -XPOST "http://localhost:8123?default_format=Native" \
--data-binary "SELECT [1, 2, 3]::Array(UInt32) AS col" | xxd00000000: 0101 0363 6f6c 0d41 7272 6179 2855 496e ...col.Array(UIn
00000010: 7433 3229 0300 0000 0000 0000 0100 0000 t32)............
00000020: 0200 0000 0300 0000 ........
Breakdown:
const block = new Uint8Array([
0x01, // NumColumns = 1 (VarUInt)
0x01, // NumRows = 1 (VarUInt)
// Column metadata:
0x03, 0x63, 0x6f, 0x6c, // Column name = "col" (String: length=3)
0x0d, 0x41, 0x72, 0x72, // Column type = "Array(UInt32)" (String: length=13)
0x61, 0x79, 0x28, 0x55,
0x49, 0x6e, 0x74, 0x33,
0x32, 0x29,
// ArraySizes stream (1 array with 3 elements):
0x03, 0x00, 0x00, 0x00, // Array offset: 3 (UInt64, little-endian)
0x00, 0x00, 0x00, 0x00,
// ArrayElements stream (3 UInt32 values):
0x01, 0x00, 0x00, 0x00, // Element 0: 1 (UInt32)
0x02, 0x00, 0x00, 0x00, // Element 1: 2 (UInt32)
0x03, 0x00, 0x00, 0x00, // Element 2: 3 (UInt32)
]);Example: Multiple arrays with varying sizes
curl -s -XPOST "http://localhost:8123?default_format=Native" \
--data-binary "SELECT arrayJoin([[1,2], [3], []]) AS col" | xxd00000000: 0103 0363 6f6c 0c41 7272 6179 2855 496e ...col.Array(UIn
00000010: 7438 2902 0000 0000 0000 0003 0000 0000 t8).............
00000020: 0000 0003 0000 0000 0000 0001 0203 ..............
Breakdown:
const block = new Uint8Array([
0x01, // NumColumns = 1
0x03, // NumRows = 3
// Column metadata:
0x03, 0x63, 0x6f, 0x6c, // Column name = "col"
0x0c, 0x41, 0x72, 0x72, // Column type = "Array(UInt8)" (length=12)
0x61, 0x79, 0x28, 0x55,
0x49, 0x6e, 0x74, 0x38,
0x29,
// ArraySizes stream (3 arrays):
0x02, 0x00, 0x00, 0x00, // Row 0: offset = 2 (cumulative: 2 elements)
0x00, 0x00, 0x00, 0x00,
0x03, 0x00, 0x00, 0x00, // Row 1: offset = 3 (cumulative: 3 elements total)
0x00, 0x00, 0x00, 0x00,
0x03, 0x00, 0x00, 0x00, // Row 2: offset = 3 (cumulative: 3 elements total)
0x00, 0x00, 0x00, 0x00,
// ArrayElements stream (3 total elements: 1, 2, 3):
0x01, // Element from row 0: 1 (UInt8)
0x02, // Element from row 0: 2 (UInt8)
0x03, // Element from row 1: 3 (UInt8)
// Row 2 has no elements (empty array)
]);Understanding Cumulative Offsets:
In the Native format, the ArraySizes stream contains cumulative offsets, not individual sizes. The offset for row i indicates the total number of elements from all arrays in rows 0 through i.
In the example above:
- Row 0: size = 2 - 0 = 2 elements (array
[1, 2]) - Row 1: size = 3 - 2 = 1 element (array
[3]) - Row 2: size = 3 - 3 = 0 elements (array
[])
Nested Arrays:
For nested arrays like Array(Array(T)), the structure repeats hierarchically:
- Outer
ArraySizesstream: cumulative offsets for the outer arrays - Inner
ArraySizesstream: cumulative offsets for all inner arrays (flattened) ArrayElementsstream: all leaf elements (flattened)
The stream paths would be:
{ArraySizes}- outer array offsets{ArrayElements, ArraySizes}- inner array offsets{ArrayElements, ArrayElements, Regular}- element data
The Map(K, V) type is internally represented as Array(Tuple(K, V)) where each map entry is a key-value pair stored as a tuple. This means Map inherits the same stream structure as Array, but with nested streams for the tuple elements.
Stream Enumeration Order:
For Map(K, V), the streams appear in this order:
- Stream path
{ArraySizes}: Map sizes (number of key-value pairs per row) - Stream path
{ArrayElements, TupleElement0, ...}: All keys, serialized according to typeK - Stream path
{ArrayElements, TupleElement1, ...}: All values, serialized according to typeV
Serialization Process:
- The number of key-value pairs in each map is written to the
ArraySizesstream (as cumulative offsets) - All keys from all maps are flattened and written consecutively
- All values from all maps are flattened and written consecutively
Example: Map(String, UInt32)
curl -s -XPOST "http://localhost:8123?default_format=Native" \
--data-binary "SELECT map('a', 1, 'b', 2)::Map(String, UInt32) AS col" | xxd00000000: 0101 0363 6f6c 134d 6170 2853 7472 696e ...col.Map(Strin
00000010: 672c 2055 496e 7433 3229 0200 0000 0000 g, UInt32)......
00000020: 0000 0161 0162 0100 0000 0200 0000 ...a.b........
Breakdown:
const block = new Uint8Array([
0x01, // NumColumns = 1
0x01, // NumRows = 1
// Column metadata:
0x03, 0x63, 0x6f, 0x6c, // Column name = "col"
0x13, 0x4d, 0x61, 0x70, // Column type = "Map(String, UInt32)" (length=19)
0x28, 0x53, 0x74, 0x72,
0x69, 0x6e, 0x67, 0x2c,
0x20, 0x55, 0x49, 0x6e,
0x74, 0x33, 0x32, 0x29,
// ArraySizes stream (1 map with 2 entries):
0x02, 0x00, 0x00, 0x00, // Map offset: 2 key-value pairs (UInt64)
0x00, 0x00, 0x00, 0x00,
// TupleElement0 stream (Keys - 2 String values):
0x01, 0x61, // Key 0: "a" (String: length=1)
0x01, 0x62, // Key 1: "b" (String: length=1)
// TupleElement1 stream (Values - 2 UInt32 values):
0x01, 0x00, 0x00, 0x00, // Value 0: 1 (UInt32)
0x02, 0x00, 0x00, 0x00, // Value 1: 2 (UInt32)
]);Stream Path Hierarchy:
Map(String, UInt32)
└── ArraySizes [2] (one entry with offset 2)
└── ArrayElements (Tuple)
├── TupleElement0 (Keys) ["a", "b"]
└── TupleElement1 (Values) [1, 2]
Empty Maps:
An empty map contributes no keys or values to the element streams. Only the offset in the ArraySizes stream reflects its presence (the offset doesn't increase from the previous row).
The Tuple type serializes each element sequentially in separate streams. Unlike Array or Map, there is no size metadata because tuples have a fixed number of elements known at schema definition time.
Stream Enumeration Order:
For Tuple(T1, T2, ..., Tn), each element is serialized in its own stream path:
- Stream path
{TupleElement0, ...}: First element, serialized according to typeT1 - Stream path
{TupleElement1, ...}: Second element, serialized according to typeT2 - ...
n. Stream path
{TupleElement(n-1), ...}: Nth element, serialized according to typeTn
Serialization Details:
Each tuple element is serialized as if it were an independent column. For a block with N rows, each element stream contains N values of its respective type. The element streams are written consecutively in the order of the tuple elements.
There is no explicit tuple structure metadata in the wire format - the tuple structure is defined by the type string, and each element is simply serialized in order.
Example: Tuple(UInt32, String)
curl -s -XPOST "http://localhost:8123?default_format=Native" \
--data-binary "SELECT (42, 'hello')::Tuple(UInt32, String) AS col" | xxd00000000: 0101 0363 6f6c 1554 7570 6c65 2855 496e ...col.Tuple(UIn
00000010: 7433 322c 2053 7472 696e 6729 2a00 0000 t32, String)*...
00000020: 0568 656c 6c6f .hello
Breakdown:
const block = new Uint8Array([
0x01, // NumColumns = 1
0x01, // NumRows = 1
// Column metadata:
0x03, 0x63, 0x6f, 0x6c, // Column name = "col"
0x15, 0x54, 0x75, 0x70, // Column type = "Tuple(UInt32, String)" (length=21)
0x6c, 0x65, 0x28, 0x55,
0x49, 0x6e, 0x74, 0x33,
0x32, 0x2c, 0x20, 0x53,
0x74, 0x72, 0x69, 0x6e,
0x67, 0x29,
// TupleElement0 stream (UInt32):
0x2a, 0x00, 0x00, 0x00, // Element 0: 42 (UInt32)
// TupleElement1 stream (String):
0x05, 0x68, 0x65, 0x6c, // Element 1: "hello" (String: length=5)
0x6c, 0x6f,
]);Multiple Rows:
For multiple rows, each element stream contains all values for that element position across all rows:
SELECT arrayJoin([(1, 'a'), (2, 'b'), (3, 'c')])::Tuple(UInt8, String) AS colWould serialize as:
TupleElement0stream:[1, 2, 3](3 UInt8 values)TupleElement1stream:['a', 'b', 'c'](3 String values)
Nested Tuples:
For nested tuples like Tuple(UInt32, Tuple(String, UInt8)), the streams are flattened hierarchically:
{TupleElement0, Regular}- First element (UInt32){TupleElement1, TupleElement0, Regular}- Nested tuple's first element (String){TupleElement1, TupleElement1, Regular}- Nested tuple's second element (UInt8)
Named Tuples:
Named tuples (e.g., Tuple(x UInt32, y String)) serialize identically to unnamed tuples. The element names are only reflected in the type string and do not affect the binary wire format. The serialization order matches the definition order.
Nested is a special column type that represents a structured collection of named fields, where each field is represented as an Array. Unlike Tuple, which stores data row-by-row, Nested stores data columnar-fashion with synchronized array lengths.
Conceptual Model:
Nested(a T1, b T2, ...)is equivalent to multipleArray(T1),Array(T2), ... columns- All arrays must have the same length for each row
- Fields are accessed using dot notation:
column.field
Wire Format:
In the Native format, Nested columns are not serialized as a single unit. Instead, they are represented as separate Array columns with a special naming convention:
- Column Naming: Each nested field becomes a separate column named
parent.field - Type: Each column has type
Array(FieldType) - Serialization: Each array column is serialized independently using
Arrayserialization (see section 6.3)
Example: Nested structure
# Note: Using Array(Tuple(...)) as Nested is mostly used internally in ClickHouse
curl -s -XPOST "http://localhost:8123?default_format=Native" \
--data-binary "SELECT [(1, 'a'), (2, 'b')]::Array(Tuple(id UInt32, name String)) AS n" | xxd00000000: 0101 016e 2441 7272 6179 2854 7570 6c65 ...n$Array(Tuple
00000010: 2869 6420 5549 6e74 3332 2c20 6e61 6d65 (id UInt32, name
00000020: 2053 7472 696e 6729 2902 0000 0000 0000 String)).......
00000030: 0001 0000 0002 0000 0001 6101 62 ..........a.b
Breakdown:
const block = new Uint8Array([
0x01, // NumColumns = 1
0x01, // NumRows = 1
// Column 0:
0x01, 0x6e, // Column name = "n" (length=1)
0x24, // Type name length = 36
0x41, 0x72, 0x72, 0x61, 0x79, 0x28, // "Array(Tuple(id UInt32, name String))"
0x54, 0x75, 0x70, 0x6c, 0x65, 0x28,
0x69, 0x64, 0x20, 0x55, 0x49, 0x6e,
0x74, 0x33, 0x32, 0x2c, 0x20, 0x6e,
0x61, 0x6d, 0x65, 0x20, 0x53, 0x74,
0x72, 0x69, 0x6e, 0x67, 0x29, 0x29,
// Array stream: ArraySizes
0x02, 0x00, 0x00, 0x00, // Array size = 2 (UInt64 little-endian)
0x00, 0x00, 0x00, 0x00,
// Array stream: ArrayElements (Tuple serialization - sequential)
// Tuple element 0 (id UInt32):
0x01, 0x00, 0x00, 0x00, // Value 1
0x02, 0x00, 0x00, 0x00, // Value 2
// Tuple element 1 (name String):
0x01, 0x61, // "a" (length=1, data)
0x01, 0x62, // "b" (length=1, data)
]);Implementation Note:
- True
Nestedcolumns (when created with explicitNestedtype) would be split into separate columns:n.id Array(UInt32)andn.name Array(String) - Each would appear as a separate column in the block with the dotted name
- The synchronization of array lengths is enforced at insertion time, not in the wire format
The Variant type represents a value that can be one of several different types, similar to a tagged union. It stores a discriminator indicating which type the value has, along with the actual value in a sparse columnar format.
Type Definition:
Variant(T1, T2, T3, ...)Stream Architecture:
Variant uses multiple logical streams:
-
VariantDiscriminators (prefix stream for version byte + data stream for discriminators)
- Serialization mode byte (UInt64):
0= BASIC,1= COMPACT - Discriminator values identifying which variant type each row contains
- Serialization mode byte (UInt64):
-
VariantElements (wrapper stream path component)
- VariantElement (one substream per variant type)
- Each variant's data stored sparsely (only values for rows with that discriminator)
- VariantElement (one substream per variant type)
Discriminator Values:
0xFF(255): NULL discriminator - the row contains NULL0toN-1: Global discriminator for variant types 0 through N-1- Discriminators are always serialized in global order (the order types appear in the type definition)
Serialization Modes:
The discriminators stream has two modes, chosen based on the use_compact_variant_discriminators_serialization setting:
BASIC Mode (mode = 0):
- All discriminators written sequentially, one per row
- Each discriminator is a single
UInt8byte - Simple but may be inefficient when granules have uniform types
COMPACT Mode (mode = 1):
- Optimizes storage when granules have uniform discriminators
- Format per granule:
<number of rows: VarUInt> <format byte: UInt8> <granule data> - Format byte values:
0(PLAIN): Mixed discriminators, all written as in BASIC mode1(COMPACT): Single discriminator for all rows, written once
Example: Variant with String and UInt64
curl -s -XPOST "http://localhost:8123?default_format=Native&allow_experimental_variant_type=1" \
--data-binary "SELECT 'hello'::Variant(String, UInt64) AS col" | xxd00000000: 0101 0363 6f6c 1756 6172 6961 6e74 2853 ...col.Variant(S
00000010: 7472 696e 672c 2055 496e 7436 3429 0000 tring, UInt64)..
00000020: 0000 0000 0000 0005 6865 6c6c 6f ........hello
Breakdown:
const block = new Uint8Array([
0x01, // NumColumns = 1
0x01, // NumRows = 1
// Column 0:
0x03, 0x63, 0x6f, 0x6c, // Column name = "col"
0x17, // Type name length = 23
0x56, 0x61, 0x72, 0x69, 0x61, 0x6e, // "Variant(String, UInt64)"
0x74, 0x28, 0x53, 0x74, 0x72, 0x69,
0x6e, 0x67, 0x2c, 0x20, 0x55, 0x49,
0x6e, 0x74, 0x36, 0x34, 0x29,
// VariantDiscriminatorsPrefix stream:
0x00, 0x00, 0x00, 0x00, // Serialization mode = 0 (BASIC)
0x00, 0x00, 0x00, 0x00, // (UInt64 little-endian)
// VariantDiscriminators stream (BASIC mode):
0x00, // Discriminator = 0 (String type)
// VariantElements/VariantElement[0] (String):
0x05, 0x68, 0x65, 0x6c, 0x6c, 0x6f, // String: length=5, "hello"
// VariantElements/VariantElement[1] (UInt64):
// (empty - no values with discriminator 1)
]);Example: Variant with multiple rows and NULL
curl -s -XPOST "http://localhost:8123?default_format=Native&allow_experimental_variant_type=1" \
--data-binary "SELECT arrayJoin([42::Variant(String, UInt64), 'hello'::Variant(String, UInt64)]) AS col" | xxd00000000: 0102 0363 6f6c 1756 6172 6961 6e74 2853 ...col.Variant(S
00000010: 7472 696e 672c 2055 496e 7436 3429 0000 tring, UInt64)..
00000020: 0000 0000 0000 0100 0568 656c 6c6f 2a00 .........hello*.
00000030: 0000 0000 0000 ......
Breakdown:
const block = new Uint8Array([
0x01, // NumColumns = 1
0x02, // NumRows = 2
// Column metadata...
0x03, 0x63, 0x6f, 0x6c, // "col"
0x17, /* "Variant(String, UInt64)" */,
// VariantDiscriminatorsPrefix:
0x00, 0x00, 0x00, 0x00, // Mode = 0 (BASIC)
0x00, 0x00, 0x00, 0x00,
// VariantDiscriminators (BASIC mode, 2 rows):
0x01, // Row 0: discriminator = 1 (UInt64)
0x00, // Row 1: discriminator = 0 (String)
// VariantElements/VariantElement[0] (String):
0x05, 0x68, 0x65, 0x6c, 0x6c, 0x6f, // 1 value: "hello"
// VariantElements/VariantElement[1] (UInt64):
0x2a, 0x00, 0x00, 0x00, // 1 value: 42
0x00, 0x00, 0x00, 0x00,
]);Note: The data for each variant is stored sparsely - only the actual values are written, in the order they appear. The discriminators determine which variant column each row's value came from.
Example: Variant with NULL
curl -s -XPOST "http://localhost:8123?default_format=Native&allow_experimental_variant_type=1" \
--data-binary "SELECT [NULL::Variant(String, UInt64), 'a'::Variant(String, UInt64), 1::Variant(String, UInt64)] AS col" | xxd00000000: 0101 0363 6f6c 1e41 7272 6179 2856 6172 ...col.Array(Var
00000010: 6961 6e74 2853 7472 696e 672c 2055 496e iant(String, UIn
00000020: 7436 3429 2900 0000 0000 0000 0003 0000 t64))...........
00000030: 0000 0000 00ff 0001 0561 0100 0000 0000 .........a......
00000040: 0000 ..
Key points in this example:
- Discriminator
0xffindicates NULL - String variant (discriminator
0x00) has 1 value: "a" - UInt64 variant (discriminator
0x01) has 1 value: 1 - Readers reconstruct the row-to-value mapping by counting discriminators per variant type
Sparse Variant Storage:
Each variant type's data column contains only the values for rows with that discriminator:
- If a
Variant(String, UInt64)column has 100 rows where 30 are String, 60 are UInt64, and 10 are NULL - The String variant column has 30 values
- The UInt64 variant column has 60 values
- The discriminators column has 100 discriminator bytes
Internally, ClickHouse maintains an offsets column to map each row to its position in the appropriate variant column. This offsets column is part of the in-memory representation only and is not serialized in the Native wire format. Readers must reconstruct this mapping by iterating through discriminators and counting values per variant type.
The Dynamic type can store values of different types, dynamically determining the type at insertion time. It's implemented internally using a Variant type with an additional "shared variant" for overflow types.
Type Definition:
Dynamic
Dynamic(max_types=N)Where max_types (default 32) determines the maximum number of distinct types that can be stored as separate variants. Additional types are stored in a "shared variant" in binary-serialized form.
Stream Architecture:
Dynamic uses two main streams:
-
DynamicStructure stream:
- Serialization version (UInt64)
- Type list (structure metadata)
- Optional statistics
-
DynamicData stream:
- Contains the internal
Variantcolumn data (in standard mode) - OR contains indexes + flattened type columns (in FLATTENED mode)
- Contains the internal
Serialization Versions:
The DynamicStructure stream begins with a version byte (UInt64) determining the format:
V1 (version = 1):
<max_dynamic_types>(VarUInt) - historical, now equals num_dynamic_types<num_dynamic_types>(VarUInt) - actual number of variant types<list of type names>(String for each type, excluding SharedVariant)- Optional statistics (in MergeTree serialization)
V2 (version = 2):
- Same as V1 but without the
max_dynamic_typesparameter - Type names as strings
V3 (version = 4):
- Like V2 but type names in binary format (via
encodeDataType) - Statistics prefixed with a bool flag indicating presence
Note: Version numbers are not sequential. V1=1, V2=2, V3=4. There is no version 3.
FLATTENED (version = 3):
- Used only in Native format for client compatibility
- Structure:
<num_types>` (VarUInt) <list of types>` (each type in binary format or as string) - Data format:
<type indexes>` (UInt8/16/32/64 column, depending on num_types + 1 for NULL) <data for type 0> <data for type 1> ...
Example: Dynamic with single value
curl -s -XPOST "http://localhost:8123?default_format=Native&allow_experimental_dynamic_type=1" \
--data-binary "SELECT 42::Dynamic AS col" | xxd00000000: 0101 0363 6f6c 0744 796e 616d 6963 0100 ...col.Dynamic..
00000010: 0000 0000 0000 0101 0653 7472 696e 6700 .........String.
00000020: 0000 0000 0000 0001 0234 32 .........42
Breakdown:
const block = new Uint8Array([
0x01, // NumColumns = 1
0x01, // NumRows = 1
// Column 0:
0x03, 0x63, 0x6f, 0x6c, // Column name = "col"
0x07, 0x44, 0x79, 0x6e, 0x61, // Column type = "Dynamic"
0x6d, 0x69, 0x63,
// DynamicStructure stream:
0x01, 0x00, 0x00, 0x00, // Version = 1 (V1)
0x00, 0x00, 0x00, 0x00,
0x01, // num_dynamic_types = 1 (VarUInt)
0x01, // Number of types in structure = 1
0x06, 0x53, 0x74, 0x72, 0x69, // Type 0: "String" (length=6)
0x6e, 0x67,
// DynamicData stream (internal Variant column):
// VariantDiscriminatorsPrefix:
0x00, 0x00, 0x00, 0x00, // Mode = 0 (BASIC)
0x00, 0x00, 0x00, 0x00,
// VariantDiscriminators:
0x01, // Discriminator = 1 (SharedVariant)
// VariantElements/VariantElement[0] (String):
// (empty - no values with discriminator 0)
// VariantElements/VariantElement[1] (SharedVariant):
0x01, // Number of values = 1
0x02, 0x34, 0x32, // Binary: type + value
// Actually: encodeDataType(String) + "42"
]);Wait, let me recheck this with the actual output. The value 42 was inferred as String type "42", and it's stored in the shared variant.
Example: Dynamic with NULL
curl -s -XPOST "http://localhost:8123?default_format=Native&allow_experimental_dynamic_type=1" \
--data-binary "SELECT NULL::Dynamic AS col" | xxd00000000: 0101 0363 6f6c 0744 796e 616d 6963 0100 ...col.Dynamic..
00000010: 0000 0000 0000 0000 0000 0000 0000 0000 ................
00000020: ff .
Breakdown:
const block = new Uint8Array([
0x01, // NumColumns = 1
0x01, // NumRows = 1
// Column metadata...
0x03, 0x63, 0x6f, 0x6c, // "col"
0x07, 0x44, 0x79, 0x6e, 0x61, // "Dynamic"
0x6d, 0x69, 0x63,
// DynamicStructure stream:
0x01, 0x00, 0x00, 0x00, // Version = 1
0x00, 0x00, 0x00, 0x00,
0x00, // num_dynamic_types = 0 (no types)
0x00, // Number of actual types = 0
// DynamicData stream (Variant):
0x00, 0x00, 0x00, 0x00, // Mode = 0 (BASIC)
0x00, 0x00, 0x00, 0x00,
0xff, // Discriminator = 0xff (NULL)
]);Example: Dynamic with FLATTENED serialization
curl -s -XPOST "http://localhost:8123?default_format=Native&output_format_native_use_flattened_dynamic_and_json_serialization=1&allow_experimental_dynamic_type=1" \
--data-binary "SELECT 42::Dynamic AS col" | xxd00000000: 0101 0363 6f6c 0744 796e 616d 6963 0300 ...col.Dynamic..
00000010: 0000 0000 0000 0106 5374 7269 6e67 0002 ........String..
00000020: 3432 42
Breakdown:
const block = new Uint8Array([
0x01, // NumColumns = 1
0x01, // NumRows = 1
// Column metadata...
0x03, 0x63, 0x6f, 0x6c, // "col"
0x07, 0x44, 0x79, 0x6e, 0x61, // "Dynamic"
0x6d, 0x69, 0x63,
// DynamicStructure stream:
0x03, 0x00, 0x00, 0x00, // Version = 3 (FLATTENED)
0x00, 0x00, 0x00, 0x00,
0x01, // Number of types = 1 (VarUInt)
0x06, 0x53, 0x74, 0x72, 0x69, // Type 0: "String" (length=6)
0x6e, 0x67,
// DynamicData stream (FLATTENED format):
// Type indexes column (UInt8, since num_types + 1 = 2 <= 256):
0x00, // Row 0: type index = 0 (String)
// Data for type 0 (String):
0x02, 0x34, 0x32, // String: length=2, "42"
]);FLATTENED Mode Details:
In FLATTENED mode, the data is densely packed:
-
Type Indexes: A column of UInt8/16/32/64 (chosen based on number of types) where:
0xFF...FF(all bits set) = NULL0toN-1= index into the types list
-
Type Data Columns: One column per type, containing only the values of that type
- Values appear in row order, but only for rows of that type
- Reader must iterate through indexes to know which type column to read next value from
Standard Mode vs FLATTENED Mode:
- Standard Mode (V1/V2/V3): Uses internal
Variantcolumn with sparse storage, more efficient for storage - FLATTENED Mode: Easier to parse for clients, all data is densely packed, index array tells which type each row has
Shared Variant:
When Dynamic has more distinct types than max_types:
- The first
max_typestypes get dedicated variant columns - Additional types are stored in binary form in a "SharedVariant" type (always the last variant)
- Each value in SharedVariant is:
encodeDataType(T) + serializeBinary(value, T) - On deserialization, the type is decoded from the binary prefix and the value is deserialized
The JSON type stores semi-structured JSON data with automatic schema inference. It uses an object-oriented storage model where JSON paths are stored as separate columns, similar to how Nested columns work.
Type Definition:
JSON
JSON(max_dynamic_paths=N)
JSON(a Int64, b String) -- with typed pathsStorage Model:
JSON columns store data using three categories of paths:
- Typed Paths: Paths with known, fixed types (declared in type definition)
- Dynamic Paths: Frequently occurring paths discovered at runtime (up to
max_dynamic_paths) - Shared Data: Remaining paths stored in a shared
Dynamiccolumn
Stream Architecture:
Similar to Dynamic, JSON uses multiple streams:
-
ObjectStructure stream:
- Serialization version (UInt64)
- Structure metadata (list of dynamic paths, statistics, etc.)
-
ObjectData stream:
- ObjectTypedPath substreams: One per typed path
- ObjectDynamicPath substreams: One per dynamic path
- ObjectSharedData substream: For overflow paths
Serialization Versions:
V1 (version = 0):
<max_dynamic_paths>(VarUInt)<num_dynamic_paths>(VarUInt)<sorted list of dynamic path names>(String for each)- Optional statistics
V2 (version = 2):
- Same as V1 without
max_dynamic_paths
V3 (version = 4):
- Like V2 with binary type encoding and optional statistics flag
Note: JSON Object version numbers are: V1=0, STRING=1, V2=2, FLATTENED=3, V3=4.
Important:
Dynamiccolumns (used for dynamic paths) have different version numbers: V1=1, V2=2, FLATTENED=3, V3=4. Don't confuse JSON V1 (=0) with Dynamic V1 (=1).
STRING (version = 1):
- Special mode for Native format only
- JSON stored as plain String (JSON text)
- Controlled by
output_format_native_write_json_as_string=1
FLATTENED (version = 3):
- Native format only, for easier client parsing
- All paths (typed + dynamic + shared) serialized as separate columns
Example: JSON with STRING serialization
curl -s -XPOST "http://localhost:8123?default_format=Native&allow_experimental_json_type=1&output_format_native_write_json_as_string=1" \
--data-binary "SELECT '{\"a\": 1}'::JSON AS col" | xxd00000000: 0101 0363 6f6c 044a 534f 4e01 0000 0000 ...col.JSON.....
00000010: 0000 0007 7b22 6122 3a31 7d ....{"a":1}
Breakdown:
const block = new Uint8Array([
0x01, // NumColumns = 1
0x01, // NumRows = 1
// Column 0:
0x03, 0x63, 0x6f, 0x6c, // Column name = "col"
0x04, 0x4a, 0x53, 0x4f, 0x4e, // Column type = "JSON"
// ObjectStructure stream:
0x01, 0x00, 0x00, 0x00, // Version = 1 (STRING)
0x00, 0x00, 0x00, 0x00,
// ObjectData stream (String column):
0x07, // String length = 7
0x7b, 0x22, 0x61, 0x22, 0x3a, // '{"a":1}'
0x31, 0x7d,
]);Example: JSON with standard serialization
curl -s -XPOST "http://localhost:8123?default_format=Native&allow_experimental_json_type=1" \
--data-binary "SELECT '{\"a\": 1, \"b\": \"hello\"}'::JSON AS col" | xxd00000000: 0101 0363 6f6c 044a 534f 4e00 0000 0000 ...col.JSON.....
00000010: 0000 0002 0201 6101 6201 0000 0000 0000 ......a.b.......
00000020: 0001 0105 496e 7436 3400 0000 0000 0000 ....Int64.......
00000030: 0001 0000 0000 0000 0001 0106 5374 7269 ............Stri
00000040: 6e67 0000 0000 0000 0000 0001 0000 0000 ng..............
00000050: 0000 0001 0568 656c 6c6f 0000 0000 0000 .....hello......
00000060: 0000 ..
Breakdown (simplified structure):
const block = new Uint8Array([
0x01, // NumColumns = 1
0x01, // NumRows = 1
// Column metadata...
0x03, 0x63, 0x6f, 0x6c, // "col"
0x04, 0x4a, 0x53, 0x4f, 0x4e, // "JSON"
// ObjectStructure stream:
0x00, 0x00, 0x00, 0x00, // Version = 0 (V1)
0x00, 0x00, 0x00, 0x00,
0x02, // num_dynamic_paths = 2 (VarUInt)
0x02, // Actual count = 2
// Sorted dynamic paths:
0x01, 0x61, // Path 0: "a" (length=1)
0x01, 0x62, // Path 1: "b" (length=1)
// Shared data serialization version:
0x01, 0x00, 0x00, 0x00, // Version info
0x00, 0x00, 0x00, 0x00,
// ObjectData stream:
// Path "a" stored as Dynamic with inferred type Int64:
0x01, // num_types = 1
0x05, 0x49, 0x6e, 0x74, 0x36, 0x34, // "Int64"
// ... Variant discriminators ...
0x00, 0x00, 0x00, 0x00, // Mode
0x00, 0x00, 0x00, 0x00,
0x01, // Discriminator
// ... Value ...
0x01, 0x00, 0x00, 0x00, // 1 (as Int64)
0x00, 0x00, 0x00, 0x00,
// Path "b" stored as Dynamic with type String:
0x01, // num_types = 1
0x06, 0x53, 0x74, 0x72, 0x69, // "String"
0x6e, 0x67,
// ... Variant discriminators ...
0x00, 0x00, 0x00, 0x00, // Mode
0x00, 0x00, 0x00, 0x00,
0x01, // Discriminator
// ... Value ...
0x05, 0x68, 0x65, 0x6c, 0x6c, 0x6f, // "hello"
]);Note: Each dynamic path is stored as a Dynamic column, so it includes the full Dynamic serialization (structure + variant data).
Example: JSON with typed and dynamic paths
This example demonstrates a JSON column with both a typed path (declared type) and a dynamic path (inferred type). Typed paths are serialized directly using their native format, while dynamic paths are wrapped in Dynamic serialization.
curl -s -XPOST "http://localhost:8123?default_format=Native&allow_experimental_json_type=1" \
--data-binary "SELECT '{\"a\": 42, \"b\": \"hi\"}'::JSON(a UInt32) AS col" | xxd00000000: 0101 0363 6f6c 0e4a 534f 4e28 6120 5549 ...col.JSON(a UI
00000010: 6e74 3332 2900 0000 0000 0000 0001 0101 nt32)...........
00000020: 6201 0000 0000 0000 0001 0106 5374 7269 b...........Stri
00000030: 6e67 0000 0000 0000 0000 2a00 0000 0102 ng........*.....
00000040: 6869 0000 0000 0000 0000 hi........
Full breakdown:
const block = new Uint8Array([
// ═══════════════════════════════════════════════════════════════════
// BLOCK HEADER
// ═══════════════════════════════════════════════════════════════════
0x01, // NumColumns = 1 (VarUInt)
0x01, // NumRows = 1 (VarUInt)
// ═══════════════════════════════════════════════════════════════════
// COLUMN HEADER
// ═══════════════════════════════════════════════════════════════════
0x03, 0x63, 0x6f, 0x6c, // Column name = "col" (len=3)
0x0e, // Type name length = 14
0x4a, 0x53, 0x4f, 0x4e, 0x28, 0x61, // "JSON(a UInt32)"
0x20, 0x55, 0x49, 0x6e, 0x74, 0x33, // - typed path "a" with type UInt32
0x32, 0x29, // - typed paths are part of the type name
// ═══════════════════════════════════════════════════════════════════
// OBJECT STRUCTURE STREAM (prefix for JSON column)
// ═══════════════════════════════════════════════════════════════════
0x00, 0x00, 0x00, 0x00, // Version = 0 (V1 serialization)
0x00, 0x00, 0x00, 0x00, // (UInt64 little-endian)
// V1 structure metadata:
0x01, // max_dynamic_paths = 1 (VarUInt)
0x01, // num_dynamic_paths = 1 (VarUInt)
// Sorted list of dynamic path names:
0x01, 0x62, // Path 0: "b" (len=1, then "b")
// - typed path "a" is NOT listed here
// - only runtime-discovered paths appear
// NOTE: In Native format, statistics mode defaults to NONE, so no
// statistics are written here. The next bytes are the Dynamic prefix.
// ═══════════════════════════════════════════════════════════════════
// DYNAMIC STRUCTURE STREAM (prefix for dynamic path "b")
// ═══════════════════════════════════════════════════════════════════
0x01, 0x00, 0x00, 0x00, // Dynamic version = 1 (V1)
0x00, 0x00, 0x00, 0x00, // (UInt64 little-endian)
// NOTE: Dynamic V1=1, V2=2, FLATTENED=3, V3=4
// V1 Dynamic structure:
0x01, // max_dynamic_types = 1 (VarUInt)
0x01, // num_dynamic_types = 1 (VarUInt)
// Type names (not SharedVariant, which is implicit):
0x06, 0x53, 0x74, 0x72, 0x69, // Type 0: "String" (len=6)
0x6e, 0x67,
// NOTE: Dynamic statistics also use NONE mode in Native format,
// so no statistics written here either.
// ═══════════════════════════════════════════════════════════════════
// VARIANT STRUCTURE STREAM (prefix for Dynamic's Variant)
// ═══════════════════════════════════════════════════════════════════
0x00, 0x00, 0x00, 0x00, // Variant mode = 0 (COMPACT)
0x00, 0x00, 0x00, 0x00, // (UInt64 little-endian)
// ═══════════════════════════════════════════════════════════════════
// OBJECT DATA STREAM
// ═══════════════════════════════════════════════════════════════════
// --- TYPED PATH "a" (UInt32) ---
// Serialized directly using UInt32's native format (no Dynamic wrapper!)
0x2a, 0x00, 0x00, 0x00, // value = 42 (UInt32 little-endian)
// --- DYNAMIC PATH "b" (as Dynamic -> Variant) ---
// Variant discriminator column (1 row):
0x01, // Row 0: discriminator = 1 (String variant)
// Variants sorted alphabetically: SharedVariant=0, String=1
// (255 = NULL)
// String variant data:
0x02, 0x68, 0x69, // value = "hi" (len=2, then "hi")
// ═══════════════════════════════════════════════════════════════════
// SHARED DATA STREAM (overflow paths, empty in this case)
// ═══════════════════════════════════════════════════════════════════
// Shared data is Array(Tuple(path: String, value: String))
// For 1 row with 0 shared paths:
0x00, 0x00, 0x00, 0x00, // Array offsets: [0] (UInt64)
0x00, 0x00, 0x00, 0x00, // - row 0 has 0 elements
]);Key observations:
-
Typed path "a" (UInt32):
- Declared in type definition:
JSON(a UInt32) - Type name includes the typed path specification
- NOT listed in the dynamic paths list
- Serialized directly as UInt32 (4 bytes, little-endian):
2a 00 00 00= 42 - No
Dynamicwrapper overhead
- Declared in type definition:
-
Dynamic path "b" (inferred as String):
- Discovered at runtime from the JSON data
- Listed in the dynamic paths list in ObjectStructure
- Serialized as a full
Dynamiccolumn:- DynamicStructure: version, types list, statistics
- VariantStructure: mode
- VariantData: discriminators + type-specific data
-
Serialization order in ObjectData:
- Typed paths first (sorted alphabetically by path name)
- Dynamic paths second (sorted alphabetically by path name)
- Shared data last
-
Space efficiency: Typed paths save significant space by avoiding the
Dynamicoverhead. For path "a", we use 4 bytes (UInt32) instead of ~20+ bytes (Dynamic structure + Variant + value).
Example: JSON with typed and dynamic paths (multiple rows)
This example shows the same structure with 2 rows, demonstrating how column data is laid out contiguously for each path.
curl -s -XPOST "http://localhost:8123?default_format=Native&allow_experimental_json_type=1" \
--data-binary "SELECT ('{\"a\": ' || toString(number * 10) || ', \"b\": \"row' || toString(number) || '\"}')::JSON(a UInt32) AS col FROM system.numbers LIMIT 2" | xxdThis produces rows: {"a": 0, "b": "row0"} and {"a": 10, "b": "row1"}.
00000000: 0102 0363 6f6c 0e4a 534f 4e28 6120 5549 ...col.JSON(a UI
00000010: 6e74 3332 2900 0000 0000 0000 0001 0101 nt32)...........
00000020: 6201 0000 0000 0000 0001 0106 5374 7269 b...........Stri
00000030: 6e67 0000 0000 0000 0000 0000 0000 0a00 ng..............
00000040: 0000 0101 0472 6f77 3004 726f 7731 0000 .....row0.row1..
00000050: 0000 0000 0000 0000 0000 0000 0000 ..............
Full breakdown:
const block = new Uint8Array([
// ═══════════════════════════════════════════════════════════════════
// BLOCK HEADER
// ═══════════════════════════════════════════════════════════════════
0x01, // NumColumns = 1 (VarUInt)
0x02, // NumRows = 2 (VarUInt) <-- 2 rows!
// ═══════════════════════════════════════════════════════════════════
// COLUMN HEADER
// ═══════════════════════════════════════════════════════════════════
0x03, 0x63, 0x6f, 0x6c, // Column name = "col" (len=3)
0x0e, // Type name length = 14
0x4a, 0x53, 0x4f, 0x4e, 0x28, 0x61, // "JSON(a UInt32)"
0x20, 0x55, 0x49, 0x6e, 0x74, 0x33,
0x32, 0x29,
// ═══════════════════════════════════════════════════════════════════
// OBJECT STRUCTURE STREAM
// ═══════════════════════════════════════════════════════════════════
0x00, 0x00, 0x00, 0x00, // Version = 0 (V1 serialization)
0x00, 0x00, 0x00, 0x00,
0x01, // max_dynamic_paths = 1 (VarUInt)
0x01, // num_dynamic_paths = 1 (VarUInt)
0x01, 0x62, // Path 0: "b" (len=1)
// NOTE: No statistics in Native format (NONE mode is default)
// ═══════════════════════════════════════════════════════════════════
// DYNAMIC STRUCTURE STREAM (for path "b")
// ═══════════════════════════════════════════════════════════════════
0x01, 0x00, 0x00, 0x00, // Dynamic version = 1 (V1)
0x00, 0x00, 0x00, 0x00, // (Dynamic V1=1, V2=2, V3=4)
0x01, // max_dynamic_types = 1 (VarUInt)
0x01, // num_dynamic_types = 1 (VarUInt)
0x06, 0x53, 0x74, 0x72, 0x69, // Type 0: "String" (len=6)
0x6e, 0x67,
// NOTE: No Dynamic statistics in Native format (NONE mode)
// ═══════════════════════════════════════════════════════════════════
// VARIANT STRUCTURE STREAM
// ═══════════════════════════════════════════════════════════════════
0x00, 0x00, 0x00, 0x00, // Variant mode = 0 (COMPACT)
0x00, 0x00, 0x00, 0x00,
// ═══════════════════════════════════════════════════════════════════
// OBJECT DATA STREAM
// ═══════════════════════════════════════════════════════════════════
// --- TYPED PATH "a" (UInt32) - ALL ROWS CONTIGUOUS ---
0x00, 0x00, 0x00, 0x00, // Row 0: a = 0
0x0a, 0x00, 0x00, 0x00, // Row 1: a = 10
// --- DYNAMIC PATH "b" - Variant discriminators (ALL ROWS) ---
// Variants sorted alphabetically: SharedVariant=0, String=1 (255=NULL)
0x01, // Row 0: discriminator = 1 (String)
0x01, // Row 1: discriminator = 1 (String)
// --- DYNAMIC PATH "b" - String variant data (ALL ROWS) ---
0x04, 0x72, 0x6f, 0x77, 0x30, // Row 0: "row0" (len=4)
0x04, 0x72, 0x6f, 0x77, 0x31, // Row 1: "row1" (len=4)
// ═══════════════════════════════════════════════════════════════════
// SHARED DATA STREAM (Array offsets for each row)
// ═══════════════════════════════════════════════════════════════════
0x00, 0x00, 0x00, 0x00, // Row 0 offset: 0 elements
0x00, 0x00, 0x00, 0x00,
0x00, 0x00, 0x00, 0x00, // Row 1 offset: 0 elements
0x00, 0x00, 0x00, 0x00,
]);Key observations for multi-row data:
-
Columnar layout: All values for each path are stored contiguously:
- Typed path "a":
[0, 10]as two consecutive UInt32 values - Dynamic path "b" discriminators:
[1, 1](both rows are String) - Dynamic path "b" String data:
["row0", "row1"]stored sequentially
- Typed path "a":
-
No per-row overhead: The structure metadata (version, paths, types) is written once in the prefix, not repeated for each row. Only the actual data values scale with row count.
-
Variant discriminators: For the Dynamic path, discriminators for ALL rows come first, then the actual variant data for all rows. This enables efficient columnar processing.
-
Shared data offsets: One UInt64 offset per row indicating how many shared path entries that row has. With 0 shared paths, all offsets are 0.
Example: JSON with exceeded max_dynamic_paths (shared data)
This example shows what happens when the number of paths exceeds max_dynamic_paths. Overflow paths are stored in the shared data section.
curl -s -XPOST "http://localhost:8123?default_format=Native&allow_experimental_json_type=1" \
--data-binary "SELECT '{\"a\": 1, \"b\": 2, \"c\": 3}'::JSON(max_dynamic_paths=2) AS col" | xxdWith max_dynamic_paths=2 and 3 paths in the JSON:
- Paths "a" and "b" become dynamic paths (stored as
Dynamiccolumns) - Path "c" overflows to shared data
00000000: 0101 0363 6f6c 194a 534f 4e28 6d61 785f ...col.JSON(max_
00000010: 6479 6e61 6d69 635f 7061 7468 733d 3229 dynamic_paths=2)
00000020: 0000 0000 0000 0000 0202 0161 0162 0100 ...........a.b..
00000030: 0000 0000 0000 0101 0549 6e74 3634 0000 .........Int64..
00000040: 0000 0000 0000 0100 0000 0000 0000 0101 ................
00000050: 0549 6e74 3634 0000 0000 0000 0000 0001 .Int64..........
00000060: 0000 0000 0000 0000 0200 0000 0000 0000 ................
00000070: 0100 0000 0000 0000 0163 090a 0300 0000 .........c......
00000080: 0000 0000 ....
Full breakdown:
const block = new Uint8Array([
// ═══════════════════════════════════════════════════════════════════
// BLOCK HEADER
// ═══════════════════════════════════════════════════════════════════
0x01, // NumColumns = 1
0x01, // NumRows = 1
// ═══════════════════════════════════════════════════════════════════
// COLUMN HEADER
// ═══════════════════════════════════════════════════════════════════
0x03, 0x63, 0x6f, 0x6c, // Column name = "col"
0x19, // Type name length = 25
// "JSON(max_dynamic_paths=2)"
0x4a, 0x53, 0x4f, 0x4e, 0x28, 0x6d, 0x61, 0x78, 0x5f,
0x64, 0x79, 0x6e, 0x61, 0x6d, 0x69, 0x63, 0x5f,
0x70, 0x61, 0x74, 0x68, 0x73, 0x3d, 0x32, 0x29,
// ═══════════════════════════════════════════════════════════════════
// OBJECT STRUCTURE STREAM
// ═══════════════════════════════════════════════════════════════════
0x00, 0x00, 0x00, 0x00, // Version = 0 (V1)
0x00, 0x00, 0x00, 0x00,
0x02, // max_dynamic_paths = 2
0x02, // num_dynamic_paths = 2
0x01, 0x61, // Path 0: "a"
0x01, 0x62, // Path 1: "b"
// Note: "c" is NOT here - it's in shared data
// NOTE: No statistics in Native format (NONE mode is default)
// ═══════════════════════════════════════════════════════════════════
// DYNAMIC PATH "a" STRUCTURE
// ═══════════════════════════════════════════════════════════════════
0x01, 0x00, 0x00, 0x00, // Dynamic version = 1 (V1)
0x00, 0x00, 0x00, 0x00, // (Dynamic V1=1, V2=2, V3=4)
0x01, // max_dynamic_types = 1
0x01, // num_dynamic_types = 1
0x05, 0x49, 0x6e, 0x74, 0x36, 0x34, // Type: "Int64"
// NOTE: No Dynamic stats (NONE mode)
0x00, 0x00, 0x00, 0x00, // Variant mode = 0 (COMPACT)
0x00, 0x00, 0x00, 0x00,
// ═══════════════════════════════════════════════════════════════════
// DYNAMIC PATH "b" STRUCTURE (similar to "a")
// ═══════════════════════════════════════════════════════════════════
0x01, 0x00, 0x00, 0x00, // Dynamic version = 1 (V1)
0x00, 0x00, 0x00, 0x00,
0x01, // max_dynamic_types = 1
0x01, // num_dynamic_types = 1
0x05, 0x49, 0x6e, 0x74, 0x36, 0x34, // Type: "Int64"
0x00, 0x00, 0x00, 0x00, // Variant mode = 0 (COMPACT)
0x00, 0x00, 0x00, 0x00,
// ═══════════════════════════════════════════════════════════════════
// DYNAMIC PATH "a" DATA (Int64 value = 1)
// ═══════════════════════════════════════════════════════════════════
// Variants sorted alphabetically: Int64=0, SharedVariant=1 (255=NULL)
0x00, // Discriminator = 0 (Int64 variant)
0x01, 0x00, 0x00, 0x00, // value = 1 (Int64 little-endian)
0x00, 0x00, 0x00, 0x00,
// ═══════════════════════════════════════════════════════════════════
// DYNAMIC PATH "b" DATA (Int64 value = 2)
// ═══════════════════════════════════════════════════════════════════
0x00, // Discriminator = 0 (Int64 variant)
0x02, 0x00, 0x00, 0x00, // value = 2 (Int64 little-endian)
0x00, 0x00, 0x00, 0x00,
// ═══════════════════════════════════════════════════════════════════
// SHARED DATA STREAM
// This is where path "c" lives (overflow from max_dynamic_paths)
// Format: Array(Tuple(path: String, value: String))
// ═══════════════════════════════════════════════════════════════════
// Array offsets (one UInt64 per row):
0x01, 0x00, 0x00, 0x00, // Row 0: offset = 1 (has 1 element)
0x00, 0x00, 0x00, 0x00,
// Shared data element 0 (path "c"):
0x01, 0x63, // Path name: "c" (len=1)
// Binary-encoded Dynamic value for "c":
0x09, // Value string length = 9 bytes
0x0a, // BinaryTypeIndex = 0x0a (Int64)
0x03, 0x00, 0x00, 0x00, // value = 3 (Int64 little-endian)
0x00, 0x00, 0x00, 0x00,
]);Key observations for shared data:
-
Overflow mechanism: When paths exceed
max_dynamic_paths, extra paths go to shared data. Here "a" and "b" are dynamic paths, "c" overflows. -
Shared data format:
Array(Tuple(path: String, value: String))- Array offsets indicate how many shared paths each row has
- Each element is a (path_name, binary_encoded_value) tuple
-
Binary-encoded values in shared data: The value is stored as a length-prefixed string containing:
BinaryTypeIndex(1 byte): Type identifier (0x0a = Int64)- Native value: The value in its type's binary format
-
BinaryTypeIndex values (common types):
Type Index Nothing 0x00 UInt8 0x01 UInt32 0x03 UInt64 0x04 Int64 0x0a String 0x15 Array 0x1e JSON 0x30 -
Space tradeoff: Shared data is less efficient than dynamic paths because each value includes its path name and type encoding. Use
max_dynamic_pathswisely for your data.
Example: Variant discriminator ordering (multiple types)
Variant types are sorted alphabetically by type name to determine discriminator values. SharedVariant is included in this sorting. Here's an example showing how discriminators are assigned:
-- Insert multiple rows with different JSON value types
INSERT INTO test_json VALUES
('{"x": true}'), -- Bool
('{"x": 3.14}'), -- Float64
('{"x": "hello"}'), -- String
('{"x": [1,2,3]}'), -- Array(Nullable(Int64))
('{"x": null}') -- NULLThe Dynamic column for path "x" will have variant types sorted alphabetically:
| Index | Type | Notes |
|---|---|---|
| 0 | Array(Nullable(Int64)) | "A" < "B" < ... |
| 1 | Bool | |
| 2 | Float64 | |
| 3 | SharedVariant | Implicit, always present |
| 4 | String | "Sh..." < "St..." |
| 255 | (NULL) | Special NULL_DISCRIMINATOR |
The discriminator bytes in the serialized data will be:
0x01, // Bool (index 1)
0x02, // Float64 (index 2)
0x04, // String (index 4, after SharedVariant=3)
0x00, // Array (index 0)
0xff, // NULL (255)
Key rule: For any type T in a Dynamic column, its discriminator = index in the alphabetically sorted list of [all_types + SharedVariant].
Path Naming:
JSON paths use dot notation:
a- top-level key "a"a.b- nested key "b" inside "a"a.b.c- deeply nested
ClickHouse can escape dots in JSON keys when json_type_escape_dots_in_keys=1 is set.
Typed Paths:
When a JSON type has typed paths declared:
JSON(user_id UInt64, name String)These paths:
- Always have the specified type (no inference needed)
- Are serialized directly in their native type format (not as
Dynamic) - Appear in
ObjectData/ObjectTypedPathsubstreams
Dynamic Paths:
Dynamic paths are:
- Discovered at insertion time
- Limited by
max_dynamic_paths(default 1024) - Stored as
Dynamiccolumns with inferred types - Appear in
ObjectData/ObjectDynamicPathsubstreams
Shared Data:
When more than max_dynamic_paths unique paths exist:
- Overflow paths are stored in a shared
Dynamiccolumn - Each value in shared data includes its full path + value
- Uses binary serialization similar to
Dynamic's shared variant
Format Setting: output_format_native_write_json_as_string:
When set to 1:
- Uses STRING serialization (version = 1)
- Entire JSON object serialized as text String
- Easier for clients that don't support complex
JSONdeserialization - Default:
0(use structured serialization)
Format Setting: output_format_native_use_flattened_dynamic_and_json_serialization:
When set to 1:
- Uses FLATTENED serialization (version = 3)
- All paths serialized as individual columns
- No
Dynamicwrapper, just direct column data - Easier for some clients to parse
- Default:
0
Reading JSON Columns:
To deserialize a JSON column in standard mode:
- Read
ObjectStructureto get version and path list - For each typed path: deserialize using its known type
- For each dynamic path: deserialize as
Dynamiccolumn - If shared data exists: deserialize the shared
Dynamiccolumn - Reconstruct JSON objects by merging values from all paths
In STRING mode: just read the String value containing JSON text.
In FLATTENED mode: read each path as a separate column, with a NULL indicator for missing paths.
Geographic data types in ClickHouse are type aliases that compose existing types (Tuple, Array) to represent geometric shapes. While their underlying representation uses standard composite types, these geo types have distinct names in the Native format's type string.
Important: The type NAME in the Native format wire protocol is the geo type alias (Point, Ring, etc.), not the underlying composite type (Tuple(Float64, Float64), Array(Point), etc.). This allows parsers to recognize geographic types explicitly.
Point represents a single 2D coordinate as a tuple of two Float64 values (X, Y).
Type Alias:
Point = Tuple(Float64, Float64)Wire Format:
The Point type serializes as a Tuple with two Float64 elements:
- Type string:
"Point"(not"Tuple(Float64, Float64)") - Stream
{TupleElement0}: X coordinate (Float64, 8 bytes little-endian) - Stream
{TupleElement1}: Y coordinate (Float64, 8 bytes little-endian)
Example: Single Point
curl -s -XPOST "http://localhost:8123?default_format=Native" \
--data-binary "SELECT (1.5, 2.5)::Point AS col" | xxd00000000: 0101 0363 6f6c 0550 6f69 6e74 0000 0000 ...col.Point....
00000010: 0000 f83f 0000 0000 0000 0440 ...?.......@
Breakdown:
const block = new Uint8Array([
0x01, // NumColumns = 1
0x01, // NumRows = 1
// Column metadata:
0x03, 0x63, 0x6f, 0x6c, // Column name = "col" (String: length=3)
0x05, 0x50, 0x6f, 0x69, // Column type = "Point" (String: length=5)
0x6e, 0x74, // (continuation)
// TupleElement0 stream (X coordinate):
0x00, 0x00, 0x00, 0x00, // X = 1.5 (Float64, little-endian)
0x00, 0x00, 0xf8, 0x3f, // IEEE 754: 0x3ff8000000000000
// TupleElement1 stream (Y coordinate):
0x00, 0x00, 0x00, 0x00, // Y = 2.5 (Float64, little-endian)
0x00, 0x00, 0x04, 0x40, // IEEE 754: 0x4004000000000000
]);Ring represents a closed loop of points, typically used to define polygon boundaries. It is an array of Point values.
Type Alias:
Ring = Array(Point)Wire Format:
The Ring type serializes as an Array(Point):
- Type string:
"Ring"(not"Array(Point)") - Stream
{ArraySizes}: Cumulative offsets (UInt64per row) - Stream
{ArrayElements, TupleElement0}: X coordinates of all points (flattenedFloat64values) - Stream
{ArrayElements, TupleElement1}: Y coordinates of all points (flattenedFloat64values)
Example: Ring with 4 points
curl -s -XPOST "http://localhost:8123?default_format=Native" \
--data-binary "SELECT [(0.0, 0.0), (1.0, 0.0), (1.0, 1.0), (0.0, 0.0)]::Ring AS col" | xxd00000000: 0101 0363 6f6c 0452 696e 6704 0000 0000 ...col.Ring.....
00000010: 0000 0000 0000 0000 0000 0000 0000 0000 ................
00000020: 00f0 3f00 0000 0000 00f0 3f00 0000 0000 ..?.......?.....
00000030: 0000 0000 0000 0000 0000 0000 0000 0000 ................
00000040: 0000 0000 0000 0000 00f0 3f00 0000 0000 ..........?.....
00000050: 0000 00 ...
Breakdown:
const block = new Uint8Array([
0x01, // NumColumns = 1
0x01, // NumRows = 1
// Column metadata:
0x03, 0x63, 0x6f, 0x6c, // Column name = "col"
0x04, 0x52, 0x69, 0x6e, // Column type = "Ring" (length=4)
0x67,
// ArraySizes stream:
0x04, 0x00, 0x00, 0x00, // Array size = 4 points (UInt64)
0x00, 0x00, 0x00, 0x00,
// ArrayElements/TupleElement0 stream (X coordinates, 4 Float64 values):
0x00, 0x00, 0x00, 0x00, // Point[0].X = 0.0
0x00, 0x00, 0x00, 0x00,
0x00, 0x00, 0x00, 0x00, // Point[1].X = 1.0
0x00, 0x00, 0xf0, 0x3f,
0x00, 0x00, 0x00, 0x00, // Point[2].X = 1.0
0x00, 0x00, 0xf0, 0x3f,
0x00, 0x00, 0x00, 0x00, // Point[3].X = 0.0
0x00, 0x00, 0x00, 0x00,
// ArrayElements/TupleElement1 stream (Y coordinates, 4 Float64 values):
0x00, 0x00, 0x00, 0x00, // Point[0].Y = 0.0
0x00, 0x00, 0x00, 0x00,
0x00, 0x00, 0x00, 0x00, // Point[1].Y = 0.0
0x00, 0x00, 0x00, 0x00,
0x00, 0x00, 0x00, 0x00, // Point[2].Y = 1.0
0x00, 0x00, 0xf0, 0x3f,
0x00, 0x00, 0x00, 0x00, // Point[3].Y = 0.0
0x00, 0x00, 0x00, 0x00,
]);Note: A valid ring should have its first and last points identical (closed loop), though this is not enforced at the wire format level.
Polygon represents a polygon consisting of an outer boundary ring and optional holes (inner rings). It is an array of Ring values.
Type Alias:
Polygon = Array(Ring)Wire Format:
The Polygon type serializes as Array(Ring), which nests as Array(Array(Point)):
- Type string:
"Polygon"(not"Array(Ring)") - Stream
{ArraySizes}: Offsets for outer array (number of rings) - Stream
{ArrayElements, ArraySizes}: Offsets for inner arrays (points per ring) - Stream
{ArrayElements, ArrayElements, TupleElement0}: All X coordinates - Stream
{ArrayElements, ArrayElements, TupleElement1}: All Y coordinates
Example: Simple Polygon with one ring
curl -s -XPOST "http://localhost:8123?default_format=Native" \
--data-binary "SELECT [[(0.0, 0.0), (10.0, 0.0), (10.0, 10.0), (0.0, 10.0), (0.0, 0.0)]]::Polygon AS col" | xxd00000000: 0101 0363 6f6c 0750 6f6c 7967 6f6e 0100 ...col.Polygon..
00000010: 0000 0000 0000 0500 0000 0000 0000 0000 ................
00000020: 0000 0000 0000 0000 0000 0000 2440 0000 ............$@..
00000030: 0000 0000 2440 0000 0000 0000 0000 0000 ....$@..........
00000040: 0000 0000 0000 0000 0000 0000 0000 0000 ................
00000050: 0000 0000 0000 0000 0000 0000 2440 0000 ............$@..
00000060: 0000 0000 2440 0000 0000 0000 0000 ....$@........
Breakdown:
const block = new Uint8Array([
0x01, // NumColumns = 1
0x01, // NumRows = 1
// Column metadata:
0x03, 0x63, 0x6f, 0x6c, // Column name = "col"
0x07, 0x50, 0x6f, 0x6c, // Column type = "Polygon" (length=7)
0x79, 0x67, 0x6f, 0x6e,
// Outer ArraySizes (number of rings):
0x01, 0x00, 0x00, 0x00, // 1 ring (UInt64)
0x00, 0x00, 0x00, 0x00,
// Inner ArraySizes (points per ring):
0x05, 0x00, 0x00, 0x00, // Ring[0] has 5 points (UInt64)
0x00, 0x00, 0x00, 0x00,
// All X coordinates (5 Float64 values):
0x00, 0x00, 0x00, 0x00, // Point[0].X = 0.0
0x00, 0x00, 0x00, 0x00,
0x00, 0x00, 0x00, 0x00, // Point[1].X = 10.0
0x00, 0x00, 0x24, 0x40, // (0x4024000000000000)
0x00, 0x00, 0x00, 0x00, // Point[2].X = 10.0
0x00, 0x00, 0x24, 0x40,
0x00, 0x00, 0x00, 0x00, // Point[3].X = 0.0
0x00, 0x00, 0x00, 0x00,
0x00, 0x00, 0x00, 0x00, // Point[4].X = 0.0
0x00, 0x00, 0x00, 0x00,
// All Y coordinates (5 Float64 values):
0x00, 0x00, 0x00, 0x00, // Point[0].Y = 0.0
0x00, 0x00, 0x00, 0x00,
0x00, 0x00, 0x00, 0x00, // Point[1].Y = 0.0
0x00, 0x00, 0x00, 0x00,
0x00, 0x00, 0x00, 0x00, // Point[2].Y = 10.0
0x00, 0x00, 0x24, 0x40,
0x00, 0x00, 0x00, 0x00, // Point[3].Y = 10.0
0x00, 0x00, 0x24, 0x40,
0x00, 0x00, 0x00, 0x00, // Point[4].Y = 0.0
0x00, 0x00, 0x00, 0x00,
]);Convention: The first ring in a polygon is the outer boundary, and subsequent rings (if any) represent holes.
MultiPolygon represents a collection of polygons. It is an array of Polygon values.
Type Alias:
MultiPolygon = Array(Polygon)Wire Format:
The MultiPolygon type serializes as Array(Polygon), nesting three levels deep as Array(Array(Array(Point))):
- Type string:
"MultiPolygon"(not"Array(Polygon)") - Stream
{ArraySizes}: Offsets for polygons - Stream
{ArrayElements, ArraySizes}: Offsets for rings within each polygon - Stream
{ArrayElements, ArrayElements, ArraySizes}: Offsets for points within each ring - Stream
{ArrayElements, ArrayElements, ArrayElements, TupleElement0}: All X coordinates - Stream
{ArrayElements, ArrayElements, ArrayElements, TupleElement1}: All Y coordinates
Example: MultiPolygon with two simple triangular polygons
curl -s -XPOST "http://localhost:8123?default_format=Native" \
--data-binary "SELECT [[[(0.0, 0.0), (10.0, 0.0), (10.0, 10.0), (0.0, 0.0)]], [[(20.0, 20.0), (30.0, 20.0), (30.0, 30.0), (20.0, 20.0)]]]::MultiPolygon AS col" | xxd00000000: 0101 0363 6f6c 0c4d 756c 7469 506f 6c79 ...col.MultiPoly
00000010: 676f 6e02 0000 0000 0000 0001 0000 0000 gon.............
00000020: 0000 0002 0000 0000 0000 0004 0000 0000 ................
00000030: 0000 0008 0000 0000 0000 0000 0000 0000 ................
00000040: 0000 0000 0000 0000 0024 4000 0000 0000 .........$@.....
00000050: 0024 4000 0000 0000 0000 0000 0000 0000 .$@.............
00000060: 0034 4000 0000 0000 003e 4000 0000 0000 .4@......>@.....
00000070: 003e 4000 0000 0000 0034 4000 0000 0000 .>@......4@.....
00000080: 0000 0000 0000 0000 0000 0000 0000 0000 ................
00000090: 0024 4000 0000 0000 0000 0000 0000 0000 .$@.............
000000a0: 0034 4000 0000 0000 0034 4000 0000 0000 .4@......4@.....
000000b0: 003e 4000 0000 0000 0034 40 .>@......4@
Breakdown omitted for brevity (follows nested array pattern with 3 levels of offsets).
LineString represents a sequence of connected line segments as an array of points.
Type Alias:
LineString = Array(Point)Wire Format:
The LineString type serializes identically to Ring (both are Array(Point)):
- Type string:
"LineString"(not"Array(Point)") - Stream
{ArraySizes}: Cumulative offsets - Stream
{ArrayElements, TupleElement0}: X coordinates - Stream
{ArrayElements, TupleElement1}: Y coordinates
Example: LineString with 3 points
curl -s -XPOST "http://localhost:8123?default_format=Native" \
--data-binary "SELECT [(1.0, 2.0), (3.0, 4.0), (5.0, 6.0)]::LineString AS col" | xxd00000000: 0101 0363 6f6c 0a4c 696e 6553 7472 696e ...col.LineStrin
00000010: 6703 0000 0000 0000 0000 0000 0000 00f0 g...............
00000020: 3f00 0000 0000 0008 4000 0000 0000 0014 ?.......@.......
00000030: 4000 0000 0000 0000 4000 0000 0000 0010 @.......@.......
00000040: 4000 0000 0000 0018 40 @.......@
Breakdown omitted (follows same pattern as Ring).
Note: Unlike Ring, a LineString is not required to form a closed loop (first and last points need not match).
MultiLineString represents a collection of line strings. It is an array of LineString values.
Type Alias:
MultiLineString = Array(LineString)Wire Format:
The MultiLineString type serializes as Array(LineString), nesting as Array(Array(Point)):
- Type string:
"MultiLineString"(not"Array(LineString)") - Stream
{ArraySizes}: Offsets for line strings - Stream
{ArrayElements, ArraySizes}: Offsets for points within each line string - Stream
{ArrayElements, ArrayElements, TupleElement0}: All X coordinates - Stream
{ArrayElements, ArrayElements, TupleElement1}: All Y coordinates
Example: MultiLineString with two line strings
curl -s -XPOST "http://localhost:8123?default_format=Native" \
--data-binary "SELECT [[(1.0, 2.0), (3.0, 4.0)], [(5.0, 6.0), (7.0, 8.0)]]::MultiLineString AS col" | xxd00000000: 0101 0363 6f6c 0f4d 756c 7469 4c69 6e65 ...col.MultiLine
00000010: 5374 7269 6e67 0200 0000 0000 0000 0200 String..........
00000020: 0000 0000 0000 0400 0000 0000 0000 0000 ................
00000030: 0000 0000 f03f 0000 0000 0000 0840 0000 .....?.......@..
00000040: 0000 0000 1440 0000 0000 0000 1c40 0000 .....@.......@..
00000050: 0000 0000 0040 0000 0000 0000 1040 0000 .....@.......@..
00000060: 0000 0000 1840 0000 0000 0000 2040 .....@...... @
Breakdown omitted (follows nested array pattern).
Geometry is a variant type that can hold any of the geographic types. It uses the Variant type structure with specific discriminant ordering for geo types.
Type Definition:
Geometry = Variant(LineString, MultiLineString, MultiPolygon, Point, Polygon, Ring)Discriminant Mapping:
| Discriminant Index | Geo Type |
|---|---|
| 0 | LineString |
| 1 | MultiLineString |
| 2 | MultiPolygon |
| 3 | Point |
| 4 | Polygon |
| 5 | Ring |
Wire Format:
The Geometry type follows standard Variant serialization (see section 6.7):
- Type string:
"Geometry"(not"Variant(LineString, ...)") - Stream
{VariantDiscriminatorsPrefix}: Serialization mode (UInt64) - Stream
{VariantDiscriminators}: Discriminator indices (mode-dependent) - Streams
{VariantElements, VariantElement[i], ...}: Sparse data for each variant type
Important Configuration:
The Geometry type contains structurally similar variants (LineString and Ring are both Array(Point)), which ClickHouse considers "suspicious" for potential ambiguity. To use Geometry, you must enable the setting:
SET allow_suspicious_variant_types = 1;Example: Geometry containing a Point
curl -s -XPOST "http://localhost:8123?default_format=Native&allow_suspicious_variant_types=1" \
--data-binary "SELECT (1.0, 2.0)::Point::Geometry AS col" | xxd00000000: 0101 0363 6f6c 0847 656f 6d65 7472 7900 ...col.Geometry.
00000010: 0000 0000 0000 0003 0000 0000 0000 f03f ...............?
00000020: 0000 0000 0000 0040 .......@
Breakdown:
const block = new Uint8Array([
0x01, // NumColumns = 1
0x01, // NumRows = 1
// Column metadata:
0x03, 0x63, 0x6f, 0x6c, // Column name = "col"
0x08, 0x47, 0x65, 0x6f, // Column type = "Geometry" (length=8)
0x6d, 0x65, 0x74, 0x72,
0x79,
// VariantDiscriminatorsPrefix stream:
0x00, 0x00, 0x00, 0x00, // Serialization mode = 0 (BASIC)
0x00, 0x00, 0x00, 0x00, // (UInt64 little-endian)
// VariantDiscriminators stream (BASIC mode):
0x03, // Discriminator = 3 (Point type)
// VariantElements/VariantElement[3] (Point data):
// TupleElement0 (X coordinate):
0x00, 0x00, 0x00, 0x00, // X = 1.0 (Float64)
0x00, 0x00, 0xf0, 0x3f,
// TupleElement1 (Y coordinate):
0x00, 0x00, 0x00, 0x00, // Y = 2.0 (Float64)
0x00, 0x00, 0x00, 0x40,
]);Key Observations:
-
Type Name vs. Underlying Type: The wire format uses the geo type alias name (e.g.,
"Point","Geometry") in the type string, not the underlying composite type name. This allows parsers to distinguish geographic types from generic tuples and arrays. -
Structural Composition: Geo types compose existing Native format features:
PointusesTuplestreamsRing,LineStringuseArraystreamsPolygon,MultiPolygon,MultiLineStringuse nestedArraystreamsGeometryusesVariantstreams
-
Sparse Variant Storage: In
Geometry, only the variant element corresponding to the discriminator contains data. Other variant elements are empty. -
Ambiguity Warning:
LineStringandRinghave identical underlying types (Array(Point)), which creates potential ambiguity inGeometry. This is whyallow_suspicious_variant_types=1is required.
QBit is a specialized vector data type designed for efficient similarity search operations. It stores vectors of floating-point numbers in a bit-transposed format for optimized SIMD operations during distance calculations.
QBit(element_type, dimension)Where:
element_type: The numeric type of vector elements (Float32,Float64, orBFloat16)dimension: The number of elements in the vector (fixed size)
Important: The wire format differs significantly between Native and RowBinary formats.
In RowBinary format, QBit is serialized exactly like an Array of the element type:
- Size (VarUInt): Number of elements in the vector (must match the declared dimension)
- Elements: Sequential float values in little-endian format
Example: QBit(Float32, 3) with values [1.0, 2.0, 3.0] in RowBinaryWithNamesAndTypes
curl -s "http://localhost:8123/?allow_experimental_qbit_type=1" \
--data-binary "SELECT [1.0, 2.0, 3.0]::QBit(Float32, 3) AS vec FORMAT RowBinaryWithNamesAndTypes" | xxd00000000: 0103 7665 6310 5142 6974 2846 6c6f 6174 ..vec.QBit(Float
00000010: 3332 2c20 3329 0300 0080 3f00 0000 4000 32, 3)....?...@.
00000020: 0040 40 .@@
The data portion (after type metadata) is:
0x03 # Size = 3 (VarUInt)
0x00 0x00 0x80 0x3F # 1.0 as little-endian Float32
0x00 0x00 0x00 0x40 # 2.0 as little-endian Float32
0x00 0x00 0x40 0x40 # 3.0 as little-endian Float32
In Native format, QBit uses a bit-transposed encoding that is NOT the same as Array. This format organizes data across multiple FixedString columns as a Tuple for SIMD-optimized distance calculations.
Example: QBit(Float32, 3) in Native format
curl -s "http://localhost:8123/?allow_experimental_qbit_type=1" \
--data-binary "SELECT [1.0, 2.0, 3.0]::QBit(Float32, 3) AS vec FORMAT Native" | xxd00000000: 0101 0376 6563 1051 4269 7428 466c 6f61 ...vec.QBit(Floa
00000010: 7433 322c 2033 2900 0601 0101 0101 0101 t32, 3).........
00000020: 0400 0000 0000 0000 0000 0000 0000 0000 ................
00000030: 0000 0000 0000 00 .......
The bit transposition maps:
- Bit j of element i → Bit i of bit-plane j
This internal format is significantly more complex than simple array serialization and is designed for optimal SIMD performance during vector similarity computations.
When using binary type encoding (with output_format_native_encode_types_in_binary_format=1):
0x36 # BinaryTypeIndex::QBit
<element_type_encoding> # Encoded element type (Float32=0x0D, Float64=0x0E, BFloat16=0x31)
<var_uint_dimension> # Vector dimension
The AggregateFunction type stores the intermediate state of an aggregate function. This enables incremental aggregation, state merging across distributed systems, and materialized aggregate views.
AggregateFunction(function_name[(param1, param2, ...)], arg_type1, arg_type2, ...)Examples:
AggregateFunction(sum, UInt64)- State forsumoverUInt64valuesAggregateFunction(quantile(0.5), Float64)- State for median calculationAggregateFunction(uniqExact, String)- State for exact unique count
Starting from revision 54452 (DBMS_MIN_REVISION_WITH_AGGREGATE_FUNCTIONS_VERSIONING), aggregate function states include version information. The version number corresponds to the client_revision and allows backward-compatible changes to serialization formats.
Before revision 54452:
- States were serialized without version metadata
- Format changes required careful coordination
From revision 54452 onwards:
- Each state includes a version number
- Version is used during both serialization and deserialization
- Allows gradual rollout of format improvements
When serializing AggregateFunction values in Native format:
Type Name Format:
"AggregateFunction(function_name(param1, param2, ...), Type1, Type2, ...)"
Binary Type Encoding (when output_format_native_encode_types_in_binary_format=1):
0x25 # BinaryTypeIndex::AggregateFunction
<var_uint_version> # Version number (e.g., client revision)
<var_uint_name_size> # Length of function name
<function_name_data> # Function name as UTF-8 bytes
<var_uint_num_params> # Number of parameters
<param_1> # Parameter 1 (Field binary encoding)
...
<param_N> # Parameter N
<var_uint_num_args> # Number of argument types
<arg_type_1_encoding> # Argument type 1 (recursive type encoding)
...
<arg_type_N_encoding> # Argument type N
State Data Format:
Each aggregate function state is serialized as an opaque binary blob. The exact format is function-specific and controlled by the IAggregateFunction::serialize() and IAggregateFunction::deserialize() methods.
When writing a single value:
<state_data> # Function-specific binary data
The state data includes:
- Internal counters, accumulators, or intermediate values
- May include embedded data structures (e.g., hash tables for
uniq, sketches forquantile) - Format is optimized for the specific aggregate function
curl -s -XPOST "http://localhost:8123?default_format=Native" \
--data-binary "SELECT sumState(number) AS col FROM numbers(5)" | xxdThe result contains:
01 # NumColumns = 1
01 # NumRows = 1
03 63 6f 6c # Column name = "col" (length=3)
# Type name = "AggregateFunction(sum, UInt64)"
<type_string>
<state_data> # For sum(0,1,2,3,4) = 10, stored as UInt64
For sum over integers, the state is simply the accumulated value. For complex functions like quantile or uniq, the state contains sophisticated data structures.
When serializing multiple aggregate function states (e.g., in a column), the format uses serializeBatch:
<state_1_data>
<state_2_data>
...
<state_N_data>
Each state is serialized sequentially, with the aggregate function controlling the exact format through its versioned serialization methods.
sum/avg: Simple accumulator values uniq: Hash table or HyperLogLog sketch quantile: T-Digest or other quantile estimation structure groupArray: Array of accumulated values topK: Heap of top K elements with counts
The Native format behavior can be customized using several format-specific settings. These settings affect both input parsing and output serialization.
Type: Bool
Default: true
Applies to: Input only
Allows automatic type conversion when reading Native format data. When enabled, ClickHouse will attempt to convert column types from the input data to match the target table schema.
Example Use Cases:
- Reading older data with slightly different type definitions
- Importing data where numeric types don't exactly match (e.g.,
UInt32toUInt64) - Upgrading tables with changed column types
When disabled (false):
- Type mismatches cause errors
- Strict type checking enforced
- Better for ensuring data consistency
SET input_format_native_allow_types_conversion = 1;
INSERT INTO table SELECT * FROM input('Native', 'col UInt64');Type: Bool
Default: false
Applies to: Input only
When enabled, expects type names to be encoded in binary format (as described in DataTypesBinaryEncoding.h) rather than as human-readable strings.
Binary format benefits:
- More compact representation (single byte for simple types)
- Faster parsing (no string parsing required)
- Unambiguous encoding
String format (default):
- Human-readable type names (e.g.,
"UInt32","String") - More compatible with external tools
- Easier debugging
Example:
-- Reading data with binary-encoded types
SET input_format_native_decode_types_in_binary_format = 1;
SELECT * FROM file('data.native', 'Native');Type: Bool
Default: false
Applies to: Output only
When enabled, writes type names in binary format rather than as strings. This is the output counterpart to input_format_native_decode_types_in_binary_format.
Benefits of binary encoding:
- Reduced output size (especially for complex nested types)
- Faster serialization
- Precise encoding of all type parameters
Type encoding table: See DataTypesBinaryEncoding.h for the complete mapping (excerpted in section 8.1).
Example:
SET output_format_native_encode_types_in_binary_format = 1;
SELECT * FROM table FORMAT Native;Binary vs String comparison:
UInt32: Binary = 1 byte (0x03), String = 11 bytes ("UInt32"with length prefix)Array(UInt8): Binary = 2 bytes (0x1E 0x01), String = 16 bytes- Complex nested types: Binary can be 10x smaller
Type: Bool
Default: false
Applies to: Output only, JSON data type
Controls how the new JSON data type is serialized in Native format.
When disabled (default false):
JSONcolumns use their native binary serialization format- Preserves full type structure with typed paths
- Supports efficient querying of JSON subcolumns
- Uses internal representation with dynamic types
When enabled (true):
JSONcolumns are serialized asStringcolumns- Each value is a JSON string
- Simpler format, easier interoperability
- Loses type information for paths
Example:
CREATE TABLE json_test (data JSON) ENGINE = Memory;
INSERT INTO json_test VALUES ('{"id": 1, "name": "test"}');
-- Native JSON serialization (structured)
SET output_format_native_write_json_as_string = 0;
SELECT * FROM json_test FORMAT Native;
-- As string (JSON text)
SET output_format_native_write_json_as_string = 1;
SELECT * FROM json_test FORMAT Native;Use when:
- Exporting to systems that don't understand ClickHouse's
JSONtype - Preserving exact JSON text representation
- Simplifying data pipeline integration
Type: Bool
Default: false
Applies to: Output only, JSON and Dynamic data types
Controls whether JSON and Dynamic columns are serialized in a flattened format where all types/paths are written as separate subcolumns.
When disabled (default false):
- Uses compact serialization optimized for the Native format
- More efficient for storage and transmission
- Uses version 2 serialization (revision >= 54473)
When enabled (true):
- Each typed path in
JSONbecomes a separate subcolumn - Each possible type in
Dynamicis stored separately - More verbose but easier to introspect
- Better compatibility with older versions or external tools
Flattened format structure:
For JSON:
- Each discovered path becomes a subcolumn (e.g.,
data.id,data.name) - Shared offsets for nested structures
- Null maps for optional paths
For Dynamic:
- Discriminator column for type selection
- Separate column for each possible type variant
- Null maps for unused variants
Example:
CREATE TABLE dynamic_test (value Dynamic) ENGINE = Memory;
INSERT INTO dynamic_test VALUES (42), ('text'), ([1,2,3]);
-- Compact serialization
SET output_format_native_use_flattened_dynamic_and_json_serialization = 0;
SELECT * FROM dynamic_test FORMAT Native;
-- Flattened serialization
SET output_format_native_use_flattened_dynamic_and_json_serialization = 1;
SELECT * FROM dynamic_test FORMAT Native;Trade-offs:
- Flattened: Easier to understand, more compatible, larger size
- Compact: More efficient, requires newer client support
The Native format behavior varies based on the protocol revision (also called client_revision in the code). This enables backward compatibility while allowing the format to evolve with new features.
For the request flow, exact source files, and the explorer's Native protocol presets, see Native Protocol Versions.
| Revision | Constant Name | Feature |
|---|---|---|
| 54405 | DBMS_MIN_REVISION_WITH_LOW_CARDINALITY_TYPE | LowCardinality support |
| 54452 | DBMS_MIN_REVISION_WITH_AGGREGATE_FUNCTIONS_VERSIONING | AggFunc versioning |
| 54454 | DBMS_MIN_REVISION_WITH_CUSTOM_SERIALIZATION | Custom serialization info |
| 54465 | DBMS_MIN_REVISION_WITH_SPARSE_SERIALIZATION | Sparse columns |
| 54473 | DBMS_MIN_REVISION_WITH_V2_DYNAMIC_AND_JSON_SERIALIZATION | Dynamic/JSON v2 |
| 54480 | DBMS_MIN_REVISION_WITH_OUT_OF_ORDER_BUCKETS_IN_AGGREGATION | Out of order bucket IDs |
| 54482 | DBMS_MIN_REVISION_WITH_REPLICATED_SERIALIZATION | Replicated columns |
| 54483 | DBMS_MIN_REVISION_WITH_NULLABLE_SPARSE_SERIALIZATION | Nullable sparse |
Feature: Introduces the LowCardinality type wrapper for efficient storage of low-cardinality string columns.
Wire Format Impact:
- Clients with
revision >= 54405: ReceiveLowCardinality(T)types with dictionary-encoded values - Clients with
revision < 54405: Server automatically converts to the underlying typeT
Backward Compatibility:
The server checks client_revision and calls recursiveRemoveLowCardinality() before sending data to old clients, ensuring they receive materialized values instead of dictionary references.
Example:
-- Server with LowCardinality column
CREATE TABLE test (country LowCardinality(String)) ENGINE = Memory;
-- Old client (revision < 54405): Receives as String
-- New client (revision >= 54405): Receives with dictionary encodingFeature: Adds version metadata to AggregateFunction state serialization.
Wire Format Impact:
- Enables versioned serialization of aggregate function states
- Version number passed to
IAggregateFunction::serialize()anddeserialize() - Allows incremental improvements to state formats
Before this revision:
- States serialized without version information
- Format changes required careful coordination across clusters
- Upgrades had to be all-or-nothing
After this revision:
- Each state includes version metadata
- Functions can support multiple format versions
- Gradual rollout of format improvements possible
Version Parameter:
The version is typically set to client_revision when serializing, allowing functions to choose appropriate format based on client capabilities.
See Section 9 for detailed AggregateFunction serialization format.
Feature: Enables custom serialization strategies based on column statistics.
Wire Format Impact:
- Server analyzes column data and derives
SerializationInfo - Serialization chosen based on actual data characteristics
- Examples: sparse serialization for columns with many default values
Process:
- Server calls
DataType::getSerializationInfo(column)to analyze data - Server calls
DataType::getSerialization(info)to get optimized serialization - Server writes serialization metadata if needed
- Client reads metadata and uses appropriate deserialization
Before this revision:
- All columns used default serialization
- No optimization based on data characteristics
After this revision:
- Sparse serialization for columns with mostly defaults
- Custom strategies for specific data patterns
- Significant storage and network savings for sparse data
Client Impact:
Old clients (revision < 54454) receive data with getDefaultSerialization() and all sparse columns are materialized.
Feature: Introduces sparse column format for efficient storage of columns with many default/zero values.
Wire Format Impact: Sparse columns are encoded as:
- Offsets array: Positions of non-default values
- Values array: Only non-default values
This dramatically reduces size for sparse data.
Example:
Regular column: [0, 0, 0, 0, 42, 0, 0, 0, 99, 0]
Sparse encoding:
offsets = [4, 8]
values = [42, 99]
Backward Compatibility:
- Clients with
revision < 54465: Server callsrecursiveRemoveSparse()to materialize all values - Clients with
revision >= 54465: Receive sparse format when beneficial
Applies to:
- Numeric columns with many zeros
- Nullable columns with many NULLs
- Any column where most values equal the default
Feature: Improved serialization format for Dynamic and JSON types.
Wire Format Impact:
- V1 (before 54473): Less efficient, simpler format
- V2 (from 54473): Optimized binary encoding, better compression
Settings Affected:
When client_revision < 54473, the server sets:
settings.dynamic_serialization_version = MergeTreeDynamicSerializationVersion::V1;
settings.object_serialization_version = MergeTreeObjectSerializationVersion::V1;V2 Improvements:
- More compact type discriminators
- Better handling of mixed-type data
- Optimized path encoding for JSON
- Reduced metadata overhead
Backward Compatibility: Automatic downgrade to V1 for old clients, ensuring seamless upgrades.
Feature: Allows bucket IDs to arrive in any order during distributed aggregation.
Wire Format Impact: This primarily affects distributed aggregation protocol rather than basic Native format. When aggregating across multiple nodes:
Before 54480:
- Bucket IDs expected in sorted order
- Required sorting/buffering
After 54480:
- Buckets can arrive in any order
- Reduced latency in distributed queries
- Better parallelism
Native Format Impact: Minimal direct impact on basic Native format structure, but affects how aggregate states are transmitted during distributed two-level aggregation.
Feature: Support for replicated column optimization in distributed queries.
Wire Format Impact:
Replicated columns share data across multiple identical copies to save memory.
Before this revision:
- All replicated columns converted to full columns:
convertToFullColumnIfReplicated() - Increased memory usage
From this revision:
- Clients with
revision >= 54482: Receive replicated columns as-is - Clients with
revision < 54482: Server materializes replicated columns
Use Case:
Common in JOIN operations where the same small dimension table is used repeatedly.
Feature: Extends sparse serialization to Nullable types.
Wire Format Impact: Enables sparse encoding for nullable columns, combining NULL map sparsity with value sparsity.
Efficiency Gains:
Example: Nullable(UInt64) with mostly NULLs and zeros
Regular: [NULL, NULL, 0, NULL, 42, NULL, 0, 0, NULL]
Sparse:
null_offsets = [2, 4, 6, 7] # non-NULL positions
null_values = [0, 42, 0, 0]
value_offsets = [1] # non-default positions within non-NULLs
value_values = [42]
Backward Compatibility:
revision < 54483: Server materializes sparse nullable columnsrevision >= 54483: Can receive optimized sparse format
The server determines serialization format based on client_revision:
if (client_revision >= DBMS_MIN_REVISION_WITH_CUSTOM_SERIALIZATION) {
// Use optimized serialization
if (client_revision < DBMS_MIN_REVISION_WITH_REPLICATED_SERIALIZATION)
column = column->convertToFullColumnIfReplicated();
if (client_revision < DBMS_MIN_REVISION_WITH_SPARSE_SERIALIZATION)
column = recursiveRemoveSparse(column);
// etc.
}This cascading check ensures maximum compatibility while enabling new optimizations for capable clients.
As of the source code, the current TCP protocol version is 54483 (DBMS_TCP_PROTOCOL_VERSION), supporting all features listed above.