diff --git a/docs/data-operate/import/file-format/json.md b/docs/data-operate/import/file-format/json.md index 5dc545f65ebbc..8c737cb2c7b7a 100644 --- a/docs/data-operate/import/file-format/json.md +++ b/docs/data-operate/import/file-format/json.md @@ -94,7 +94,7 @@ The following table lists the JSON format parameters supported by various loadin 2. Broker Load: Parameters are specified through `PROPERTIES`, e.g., `PROPERTIES("jsonpaths"="$.data")` 3. Routine Load: Parameters are specified through `PROPERTIES`, e.g., `PROPERTIES("jsonpaths"="$.data")` 4. TVF: Parameters are specified in TVF statements, e.g., `S3("jsonpaths"="$.data")` -5. If you need to load the JSON object at the root node of a JSON file, the jsonpaths should be specified as $., e.g., `PROPERTIES("jsonpaths"="$.")` +5. If you need to load the JSON object at the root node of a JSON file, the jsonpaths should be specified as `$.` or `$`, e.g., `PROPERTIES("jsonpaths"="$.")` 6. The default value of read_json_by_line is true, which means if neither strip_outer_array nor read_json_by_line is specified during import, read_json_by_line will be set to true. 7. "read_json_by_line not configurable" means it is forcibly set to true to enable streaming reading and reduce BE memory usage. ::: @@ -186,154 +186,95 @@ The following table lists the JSON format parameters supported by various loadin -- Set num_as_string=true, price field will be parsed as string ``` -### Relationship between JSON Path and Columns +## Usage Examples -During data loading, JSON Path and Columns serve different responsibilities: +This section demonstrates how to use JSON format with different loading methods, and explains the parameters required for various JSON formats (using Stream Load as an example). -**JSON Path**: Defines data extraction rules - - Extracts fields from JSON data according to specified paths - - Extracted fields are reordered according to the order defined in JSON Path +### Parameter Usage Guide -**Columns**: Defines data mapping rules - - Maps extracted fields to target table columns - - Can perform column reordering and transformation +#### JSON Format Parameters -These two parameters are processed serially: first, JSON Path extracts fields from source data and forms an ordered dataset, then Columns maps these data to table columns. If Columns is not specified, extracted fields will be mapped directly according to table column order. +For different JSON file formats, there are two important parameters that control how data is read during import: -#### Usage Examples +- `strip_outer_array` +- `read_json_by_line` -##### Using JSON Path Only +**Example 1: One Line One Json Record** -Table structure and data: -```sql --- Table structure -CREATE TABLE example_table ( - k2 int, - k1 int -); +Each line contains a complete JSON record and is imported as a stream. When users don't specify values for these two parameters, the default settings are `read_json_by_line=true` and `strip_outer_array=false`. Therefore, users don't need to specify these parameters for this JSON format (although explicitly setting `read_json_by_line` is also acceptable). --- JSON data -{"k1": 1, "k2": 2} +```JSON +{"a": 1, "b": 11} +{"a": 2, "b": 12} +{"a": 3, "b": 13} +{"a": 4, "b": 14} ``` -Load command: -```shell -curl -v ... -H "format: json" \ - -H "jsonpaths: [\"$.k2\", \"$.k1\"]" \ - -T example.json \ - http://:/api/db_name/table_name/_stream_load -``` +If you mistakenly set `strip_outer_array` to true, you will see an error message in `FirstErrorMsg` like: `JSON data is not an array-object, strip_outer_array must be FALSE`. -Load result: -```text -+------+------+ -| k1 | k2 | -+------+------+ -| 2 | 1 | -+------+------+ -``` +--- -##### Using JSON Path + Columns +**Example 2: Array-Format JSON Records** -Using the same table structure and data, adding columns parameter: +When JSON records are organized as an array in the file, you need to set `strip_outer_array=true`. -Load command: -```shell -curl -v ... -H "format: json" \ - -H "jsonpaths: [\"$.k2\", \"$.k1\"]" \ - -H "columns: k2, k1" \ - -T example.json \ - http://:/api/db_name/table_name/_stream_load +```JSON +[ + {"a": 1, "b": 11}, + {"a": 2, "b": 12} +] ``` -Load result: -```text -+------+------+ -| k1 | k2 | -+------+------+ -| 1 | 2 | -+------+------+ -``` +If you mistakenly set `read_json_by_line` to true, you will see an error message in `FirstErrorMsg` like: `Parse json data failed. code: 28...`. -##### Field Reuse +--- -Table structure and data: -```sql --- Table structure -CREATE TABLE example_table ( - k2 int, - k1 int, - k1_copy int -); +**Example 3: One Line Multiple Json Records** --- JSON data -{"k1": 1, "k2": 2} -``` +If each line contains an array with multiple JSON records, you need to explicitly set both parameters to true. -Load command: -```shell -curl -v ... -H "format: json" \ - -H "jsonpaths: [\"$.k2\", \"$.k1\", \"$.k1\"]" \ - -H "columns: k2, k1, k1_copy" \ - -T example.json \ - http://:/api/db_name/table_name/_stream_load +```JSON +[{"a": 1, "b": 11},{"a": 2, "b": 12}] +[{"a": 3, "b": 13},{"a": 4, "b": 14}] ``` -Load result: -```text -+------+------+---------+ -| k2 | k1 | k1_copy | -+------+------+---------+ -| 2 | 1 | 1 | -+------+------+---------+ -``` +If you forget to set `strip_outer_array`, you will see an error message in `FirstErrorMsg` like: `JSON data is array-object, strip_outer_array must be TRUE`. If you forget to set `read_json_by_line`, **only the first line** (the two JSON records on the first line) will be imported. Please be aware of this behavior. + +#### JSON Path Related Parameters -##### Nested Field Mapping +During JSON import, you can configure `jsonpaths` and `json_root` to have more flexible control over data extraction paths, which provides support for importing complex nested JSON formats. Another related parameter is `columns`. -Table structure and data: ```sql -- Table structure CREATE TABLE example_table ( - k2 int, - k1 int, - k1_nested1 int, - k1_nested2 int -); + a INT, + b INT +) -- JSON data -{ - "k1": 1, - "k2": 2, - "k3": { - "k1": 31, - "k1_nested": { - "k1": 32 - } - } -} +[ + {"id":1, "record":{"year":25, "name":"hiki"}}, + {"id":2, "record":{"year":20, "name":"ykk"}} +] ``` -Load command: ```shell curl -v ... -H "format: json" \ - -H "jsonpaths: [\"$.k2\", \"$.k1\", \"$.k3.k1\", \"$.k3.k1_nested.k1\"]" \ - -H "columns: k2, k1, k1_nested1, k1_nested2" \ + -H "jsonpaths:[\"$.id\",\"$.record.year\"]" \ + -H "columns:b,a" \ -T example.json \ http://:/api/db_name/table_name/_stream_load -``` -Load result: -```text -+------+------+------------+------------+ -| k2 | k1 | k1_nested1 | k1_nested2 | -+------+------+------------+------------+ -| 2 | 1 | 31 | 32 | -+------+------+------------+------------+ +select * from example_table; ++------+------+ +| a | b | ++------+------+ +| 20 | 2 | +| 25 | 1 | ++------+------+ ``` -## Usage Examples - -This section demonstrates the usage of JSON format in different loading methods. +From the import command and the imported data above, we can clearly see how these parameters work together. You can think of JSON import as a two-step process: first, data is read from the JSON file and organized into an array of row data, then each row is imported into the table one by one. The order of data in this array (from JSON file to row array) is controlled by `jsonpaths` and `json_root`. In the example above, the data order for each row in the array is `id` and `year` from the JSON file. The mapping relationship between each row's data and the table columns is specified by `columns`. In the example above, the `id` data is imported into column `b`, and the `year` data is imported into column `a`. ### Stream Load diff --git a/i18n/zh-CN/docusaurus-plugin-content-docs/current/data-operate/import/file-format/json.md b/i18n/zh-CN/docusaurus-plugin-content-docs/current/data-operate/import/file-format/json.md index ddaa2af70eaa5..d3e3f8e5de673 100644 --- a/i18n/zh-CN/docusaurus-plugin-content-docs/current/data-operate/import/file-format/json.md +++ b/i18n/zh-CN/docusaurus-plugin-content-docs/current/data-operate/import/file-format/json.md @@ -95,7 +95,7 @@ Doris 支持以下三种 JSON 格式: 2. Broker Load:参数通过 `PROPERTIES` 指定,如:`PROPERTIES("jsonpaths"="$.data")` 3. Routine Load:参数通过 `PROPERTIES` 指定,如:`PROPERTIES("jsonpaths"="$.data")` 4. TVF:参数通过 TVF 语句指定,如:`S3("jsonpaths"="$.data")` -5. 如果需要将 JSON 文件中根节点的 JSON 对象导入,jsonpaths 需要指定为$.,如:`PROPERTIES("jsonpaths"="$.")` +5. 如果需要将 JSON 文件中根节点的 JSON 对象导入,jsonpaths 需要指定为`$.`或者`$`,如:`PROPERTIES("jsonpaths"="$.")` 6. read_json_by_line默认为true指的是如果导入时不指定strip_outer_array和read_json_by_line任何一个, 那么read_json_by_line为true. 7. read_json_by_line不支持配置指强制设置为true, 开启流式读取降低BE内存压力 ::: @@ -188,155 +188,95 @@ Doris 支持以下三种 JSON 格式: ``` -### JSON Path 和 Columns 的关系 +## 使用示例 -在数据导入过程中,JSON Path 和 Columns 各自承担不同的职责: +本节展示了不同导入方式下的 JSON 格式使用方法, 以及各种JSON格式下导入需要指定的参数(以stream load为例)。 -**JSON Path**:定义数据抽取规则 - - 从 JSON 数据中按指定路径抽取字段 - - 抽取的字段按 JSON Path 中定义的顺序进行重排列 +### 参数使用说明 -**Columns**:定义数据映射规则 - - 将抽取的字段映射到目标表的列 - - 可以进行列的重排和转换 +#### JSON格式相关参数 -这两个参数的处理过程是串行的:首先 JSON Path 从源数据中抽取字段并形成有序的数据集,然后 Columns 将这些数据映射到表的列中。如果不指定 Columns,抽取的字段将按照表的列顺序直接映射。 +对于不同格式的JSON文件,控制导入时候读取方式的两个比较重要的参数: -#### 使用示例 +- `strip_outer_array` +- `read_json_by_line` -##### 仅使用 JSON Path +**example1:多行JSON记录** -表结构和数据: -```sql --- 表结构 -CREATE TABLE example_table ( - k2 int, - k1 int -); +每行作为一个完整的JSON记录流式地导入,当用户未指定这两个参数的值时候,默认设置`read_json_by_line`为true,`strip_outer_array`为false,因此这种格式的JSON用户不需要对这两个参数做指定(显示指定`read_json_by_line`也是可以的) --- JSON 数据 -{"k1": 1, "k2": 2} +```JSON +{"a": 1, "b": 11} +{"a": 2, "b": 12} +{"a": 3, "b": 13} +{"a": 4, "b": 14} ``` -导入命令: -```shell -curl -v ... -H "format: json" \ - -H "jsonpaths: [\"$.k2\", \"$.k1\"]" \ - -T example.json \ - http://:/api/db_name/table_name/_stream_load - -``` +如果用户误设置`strip_outer_array`为true,会在`FirstErrorMsg`中看到`JSON data is not an array-object, strip_outer_array must be FALSE`这样的报错信息。 -导入结果: -```text -+------+------+ -| k1 | k2 | -+------+------+ -| 2 | 1 | -+------+------+ -``` +--- -##### 使用 JSON Path + Columns +**example2:数组形式的JSON记录** -使用相同的表结构和数据,添加 columns 参数: +文件中JSON记录以数组的形式组织,需要用户指定`strip_outer_array`为true。 -导入命令: -```shell -curl -v ... -H "format: json" \ - -H "jsonpaths: [\"$.k2\", \"$.k1\"]" \ - -H "columns: k2, k1" \ - -T example.json \ - http://:/api/db_name/table_name/_stream_load +```JSON +[ + {"a": 1, "b": 11}, + {"a": 2, "b": 12} +] ``` -导入结果: -```text -+------+------+ -| k1 | k2 | -+------+------+ -| 1 | 2 | -+------+------+ -``` +如果用户误设置`read_json_by_line`为true,会在`FirstErrorMsg`中看到`Parse json data failed. code: 28...`这样的报错信息。 -##### 字段重复使用 +--- -表结构和数据: -```sql --- 表结构 -CREATE TABLE example_table ( - k2 int, - k1 int, - k1_copy int -); +**example3:多行数组** --- JSON 数据 -{"k1": 1, "k2": 2} -``` +如果每一行以数组的形式记录多个JSON记录,需要用户显式的将两个参数都设置为true。 -导入命令: -```shell -curl -v ... -H "format: json" \ - -H "jsonpaths: [\"$.k2\", \"$.k1\", \"$.k1\"]" \ - -H "columns: k2, k1, k1_copy" \ - -T example.json \ - http://:/api/db_name/table_name/_stream_load +```JSON +[{"a": 1, "b": 11},{"a": 2, "b": 12}] +[{"a": 3, "b": 13},{"a": 4, "b": 14}] ``` -导入结果: -```text -+------+------+---------+ -| k2 | k1 | k1_copy | -+------+------+---------+ -| 2 | 1 | 1 | -+------+------+---------+ -``` +如果用户少设置了`strip_outer_array`,会在`FirstErrorMsg`中看到`JSON data is array-object, strip_outer_array must be TRUE`这样的报错信息。而如果用户少设置了`read_json_by_line`,这里**只会导入第一行**的两个JSON记录,需要注意一下。 + +#### JSON path相关参数 -##### 嵌套字段映射 +在JSON导入的过程中,用户可以通过设置`jsonpaths`和`json_root`,来更自由地控制数据抽取的路径,以提供对嵌套JSON的一些复杂格式的导入支持。与之相关的另一个参数项是`columns`。 -表结构和数据: ```sql -- 表结构 CREATE TABLE example_table ( - k2 int, - k1 int, - k1_nested1 int, - k1_nested2 int -); + a INT, + b INT +) -- JSON 数据 -{ - "k1": 1, - "k2": 2, - "k3": { - "k1": 31, - "k1_nested": { - "k1": 32 - } - } -} +[ + {"id":1, "record":{"year":25, "name":"hiki"}}, + {"id":2, "record":{"year":20, "name":"ykk"}} +] ``` -导入命令: ```shell curl -v ... -H "format: json" \ - -H "jsonpaths: [\"$.k2\", \"$.k1\", \"$.k3.k1\", \"$.k3.k1_nested.k1\"]" \ - -H "columns: k2, k1, k1_nested1, k1_nested2" \ + -H "jsonpaths:[\"$.id\",\"$.record.year\"]" \ + -H "columns:b,a" \ -T example.json \ http://:/api/db_name/table_name/_stream_load -``` -导入结果: -```text -+------+------+------------+------------+ -| k2 | k1 | k1_nested1 | k1_nested2 | -+------+------+------------+------------+ -| 2 | 1 | 31 | 32 | -+------+------+------------+------------+ +select * from example_table; ++------+------+ +| a | b | ++------+------+ +| 20 | 2 | +| 25 | 1 | ++------+------+ ``` -## 使用示例 - -本节展示了不同导入方式下的 JSON 格式使用方法。 +从上面这样的导入命令和导入后数据不难看出这几个参数的交互。可以假设JSON导入是先从JSON文件中把数据读取出来,组织成一个行数据数组,然后将每一行数据逐个导入到表中。其中从JSON文件到行数据数组的数据分布顺序就由`jsonpaths`和`json_root`来控制,上面的例子中,该数组每一行数据分布的顺序就是JSON文件中的`id`和`year`,将每一行的数据导入到表中各项数据的映射关系,则由`columns`来指定,上面的例子中将`id`的数据导入到`b`表项,将`year`的数据导入到`a`表项。 ### Stream Load 导入 diff --git a/i18n/zh-CN/docusaurus-plugin-content-docs/version-2.1/data-operate/import/file-format/json.md b/i18n/zh-CN/docusaurus-plugin-content-docs/version-2.1/data-operate/import/file-format/json.md index 95ee337ff4c02..6a3cde41a54ed 100644 --- a/i18n/zh-CN/docusaurus-plugin-content-docs/version-2.1/data-operate/import/file-format/json.md +++ b/i18n/zh-CN/docusaurus-plugin-content-docs/version-2.1/data-operate/import/file-format/json.md @@ -94,7 +94,7 @@ Doris 支持以下三种 JSON 格式: 2. Broker Load:参数通过 `PROPERTIES` 指定,如:`PROPERTIES("jsonpaths"="$.data")` 3. Routine Load:参数通过 `PROPERTIES` 指定,如:`PROPERTIES("jsonpaths"="$.data")` 4. TVF:参数通过 TVF 语句指定,如:`S3("jsonpaths"="$.data")` -5. 如果需要将 JSON 文件中根节点的 JSON 对象导入,jsonpaths 需要指定为$.,如:`PROPERTIES("jsonpaths"="$.")` +5. 如果需要将 JSON 文件中根节点的 JSON 对象导入,jsonpaths 需要指定为`$.`或者`$`,如:`PROPERTIES("jsonpaths"="$.")` ::: ### 参数说明 diff --git a/i18n/zh-CN/docusaurus-plugin-content-docs/version-3.0/data-operate/import/file-format/json.md b/i18n/zh-CN/docusaurus-plugin-content-docs/version-3.0/data-operate/import/file-format/json.md index 646f4c80e0823..8212a7117d79b 100644 --- a/i18n/zh-CN/docusaurus-plugin-content-docs/version-3.0/data-operate/import/file-format/json.md +++ b/i18n/zh-CN/docusaurus-plugin-content-docs/version-3.0/data-operate/import/file-format/json.md @@ -94,7 +94,7 @@ Doris 支持以下三种 JSON 格式: 2. Broker Load:参数通过 `PROPERTIES` 指定,如:`PROPERTIES("jsonpaths"="$.data")` 3. Routine Load:参数通过 `PROPERTIES` 指定,如:`PROPERTIES("jsonpaths"="$.data")` 4. TVF:参数通过 TVF 语句指定,如:`S3("jsonpaths"="$.data")` -5. 如果需要将 JSON 文件中根节点的 JSON 对象导入,jsonpaths 需要指定为$.,如:`PROPERTIES("jsonpaths"="$.")` +5. 如果需要将 JSON 文件中根节点的 JSON 对象导入,jsonpaths 需要指定为`$.`或者`$`,如:`PROPERTIES("jsonpaths"="$.")` ::: ### 参数说明 diff --git a/versioned_docs/version-2.1/data-operate/import/file-format/json.md b/versioned_docs/version-2.1/data-operate/import/file-format/json.md index fe7c788f3a420..846bd47b35d71 100644 --- a/versioned_docs/version-2.1/data-operate/import/file-format/json.md +++ b/versioned_docs/version-2.1/data-operate/import/file-format/json.md @@ -94,7 +94,7 @@ The following table lists the JSON format parameters supported by various loadin 2. Broker Load: Parameters are specified through `PROPERTIES`, e.g., `PROPERTIES("jsonpaths"="$.data")` 3. Routine Load: Parameters are specified through `PROPERTIES`, e.g., `PROPERTIES("jsonpaths"="$.data")` 4. TVF: Parameters are specified in TVF statements, e.g., `S3("jsonpaths"="$.data")` -5. If you need to load the JSON object at the root node of a JSON file, the jsonpaths should be specified as $., e.g., PROPERTIES("jsonpaths"="$.") +5. If you need to load the JSON object at the root node of a JSON file, the jsonpaths should be specified as `$.`or`$`, e.g., PROPERTIES("jsonpaths"="$.") ::: ### Parameter Description diff --git a/versioned_docs/version-3.0/data-operate/import/file-format/json.md b/versioned_docs/version-3.0/data-operate/import/file-format/json.md index a8dcef1d9c12d..355cd70f5654d 100644 --- a/versioned_docs/version-3.0/data-operate/import/file-format/json.md +++ b/versioned_docs/version-3.0/data-operate/import/file-format/json.md @@ -94,7 +94,7 @@ The following table lists the JSON format parameters supported by various loadin 2. Broker Load: Parameters are specified through `PROPERTIES`, e.g., `PROPERTIES("jsonpaths"="$.data")` 3. Routine Load: Parameters are specified through `PROPERTIES`, e.g., `PROPERTIES("jsonpaths"="$.data")` 4. TVF: Parameters are specified in TVF statements, e.g., `S3("jsonpaths"="$.data")` -5. If you need to load the JSON object at the root node of a JSON file, the jsonpaths should be specified as $., e.g., PROPERTIES("jsonpaths"="$.") +5. If you need to load the JSON object at the root node of a JSON file, the jsonpaths should be specified as `$.`or`$`, e.g., PROPERTIES("jsonpaths"="$.") ::: ### Parameter Description