[Design] Accelerate json function with simdjson #4486
Replies: 10 comments 13 replies
-
CC: @Yuhta @laithsakka |
Beta Was this translation helpful? Give feedback.
-
I am wondering if we should have DOM element as the actual underlying value for JSON type in this case. Looks like we could lose a lot performance if we have to convert the DOM element back to string when there is a chain of these functions. @wanweiqiangintel How fast is simdjson on parsing the string into DOM? Do you have some profiling on this and compare it to the actual Presto simdjson functions above? If it is a small fraction of the Presto function we can keep using string. Otherwise we may want to use a custom type with JSON. |
Beta Was this translation helpful? Give feedback.
-
@wanweiqiangintel Would you clarify a bit what's not working? |
Beta Was this translation helpful? Give feedback.
-
@wanweiqiangintel Thank you for clarifying. Would you create a separate GitHub issue for this? We should fix it. CC: @laithsakka |
Beta Was this translation helpful? Give feedback.
-
@wanweiqiangintel Would it be possible to implement all JSON functions supported in Presto: https://prestodb.io/docs/current/functions/json.html ? |
Beta Was this translation helpful? Give feedback.
-
@wanweiqiangintel My reading of this section suggests that SIMD-based implementation is fully compatible with Presto, while current implementation is not. Is this the case? If so, let's clarify that SIMD-based implementation is fully compatible and file GitHub issues about current implementation not being compatible. |
Beta Was this translation helpful? Give feedback.
-
@wanweiqiangintel Would it be possible to add sections about benchmarking and testing. It would be nice to clarify how we are going to test the new implementation. Are we going to re-use existing tests and allow for running those twice: once using regular JSON functions and another time using SIMD-based functions? For benchmarks, it would be nice to explain which datasets we are planning to use and why these are representative. |
Beta Was this translation helpful? Give feedback.
-
@wanweiqiangintel Overall, this design looks good to me. WRT the implementation, I suggest to first create a PR that adds simdjson dependency and implements one function using it. Then, submit more PRs adding more functions (1-2 functions per PR if possible). |
Beta Was this translation helpful? Give feedback.
-
@wanweiqiangintel I'm also thinking that it would be good to have only one implementation in production code. I wonder if we could replace functions one by one and move existing implementations to the benchmark directory if it is needed for benchmarking. |
Beta Was this translation helpful? Give feedback.
-
FYI, in addition to the 7 functions listed above, there are also cast-to-JSON and cast-from-JSON functions in Velox. The implementation can be found in the JsonCastOperator class in JsonType.h/cpp. |
Beta Was this translation helpful? Give feedback.
-
JSON introduction
JSON (JavaScript Object Notation) is a lightweight data-interchange format. It is easy for humans to read and write. It is easy for machines to parse and generate. It is based on a subset of the JavaScript Programming Language Standard ECMA-262 3rd Edition - December 1999. JSON is a text format that is completely language independent but uses conventions that are familiar to programmers of the C-family of languages, including C, C++, C#, Java, JavaScript, Perl, Python, and many others. These properties make JSON an ideal data-interchange language.
JSON is built on two structures:
JSON example:
'{"name":"John", "age":30, "books":[“A”,”B”,”C”]}'
JSON function in Velox
In the current Velox, 7 json related functions are supported as below. Functions 1~ 5 are implemented as simple functions. Functions 6~ 7 are implemented as vector functions. The input and output of those functions have been aligned with Presto json. But the behavior has some differences compared with Presto, details will be discussed in later chapters.
is_json_scalar
json_array_contains
json_array_length
json_extract_scalar
json_size
json_format
json_parse
Simdjson introduction:
The simdjson library uses commonly available SIMD instructions and microparallel algorithms to parse JSON 4x faster than RapidJSON and 25x faster than JSON for Modern C++.
Json function with simdjson in Velox
There are 7 JSON functions implemented with simdjson in Velox. The function
json_extract
is a new function that is very important in Presto. The input and output of the 7 functions have been aligned with Presto. While the output is json type, we use varchar to replace it temporarily because the built-in json type doesn’t support StringWriter method.is_json_scalar
json_array_contains
json_array_length
json_extract_scalar
json_parse
json_size
json_extract
Implementation Details
There are two methods of parsing json data in simdjson:
The simdjson community refers to "On Demand" as a front-end component since it is an interface between the low-level parsing functions and the user. It hides much of the complexity of parsing JSON documents. "On Demand" parse the document and find all the structural indexes(i.e. stage 1), without including building json type. If we need to find different values in the json object many times, using the dom parse method is better.
According to the interface of simdjson, we create a struct ParserContext.
The interface of those functions are in the file SIMDJsonFunctions.h.
SIMDIsJsonScalarFunction:
is_json_scalar(json) → boolean (Presto link)
SELECT is_json_scalar('1'); -- true
SELECT is_json_scalar('[1, 2, 3]'); -- false
json element value includes 6 different types,
array/object/number/string/boolean/null
. If the input json type belongs to one of the 4 typesnumber/string/Boolean/null
, the is_json_scalar(json) returns true, otherwise returns false.SIMDJsonArrayContainsFunction
json_array_contains(json, value) → boolean (Presto link)
SELECT json_array_contains('[1, 2, 3]', 2); -- true
Determine if a value exists in json (a string containing a JSON array). The type of a json value can be one of four types:
bool/int/double/varchar
.SIMDJsonArrayLengthFunction
json_array_length(json) → int64 (Presto link)
SELECT json_array_length('[1, 2, 3]'); -- 3
Returns the array length of json (a string containing a JSON array).
SIMDJsonExtractScalarFunction
json_extract_scalar(json, json_path) → varchar (Presto link)
Like json_extract(), but returns the result value as a string (as opposed to being encoded as JSON). The value referenced by json_path must be a scalar (boolean, number or string).
SIMDJsonParseFunction
json_parse(string) → varchar (original version is json in velox) (Presto link)
SELECT json_parse('[1, 2, 3]'); -- JSON '[1,2,3]'
SELECT json_parse('"abc"'); -- JSON '"abc"'
Returns the JSON value deserialized from the input JSON text. This is the inverse function to json_format(). Note: the original version of the return type is json. But the built-in json type doesn’t support the StringWriter method, we return varchar temporarily. Another method is adding the template specialization
struct resolver<CustomType<JSON>>
SIMDJsonExtractFunction
json_extract(json, json_path) → varchar (original version is json in velox) (Presto link)
SIMDJsonSizeFunction
json_size(json, json_path) → int64
Like json_extract(), but returns the size of the value. For objects or arrays, the size is the number of members, and the size of a scalar value is zero:
SELECT json_size('{"x": {"a": 1, "b": 2}}', '$.x'); -- 2
SELECT json_size('{"x": [1, 2, 3]}', '$.x'); -- 3
SELECT json_size('{"x": {"a": 1, "b": 2}}', '$.x.a'); -- 0
Benchmark
We have added two benchmarks about
json_extract
andjson_parse
to compare the performance between simdjson and folly implement.The benchmark implementations are in JsonExprBenchmark.cpp(expression mode) and JsonFunctionBenchmark.cpp(function mode).
The performance result is as below:
json_extract performance(expression mode):
json_extract performance(function mode):
json_parse performance(expression smode):
Detect
We have added a parameter in the function registerJsonFunctions(bool useSimd), users can assign useSimd=1 to enable simdjson function.
Semantic differences
In the overflow test of
json_extract_scalar
: TEST_F(JsonExtractScalarTest, overflow).jsonExtractScalar(R"(184467440737095516151844674407370955161518446744073709551615)","$")
The JsonExtractScalarFunction return
std::nullopt
,SIMDJsonExtractScalarFunction return
184467440737095516151844674407370955161518446744073709551615
,Presto return
184467440737095516151844674407370955161518446744073709551615
.In the wildcard test of
json_extract_scalar
: TEST_F(JsonExtractScalarTest, wildcardSelect)jsonExtractScalar(R"({"tags":{"a":["b"],"c":["d"]}})", "$.tags.c[*]")
The JsonExtractScalarFunction return
d
SIMDJsonExtractScalarFunction return
std::null
,The Presto return
std::null
.In the
jsonParse
test: TEST_F(JsonFunctionsTest, jsonParse).jsonParse(R"([1, 2, 3])")
The JsonParseFunction return
[1, 2, 3]
SIMDJsonExtractScalarFunction return
[1,2,3]
The Presto return
[1,2,3]
.Beta Was this translation helpful? Give feedback.
All reactions