Presto Java JSON functions support more malformed JSON than Velox #6597

kevinwilfong · 2023-09-15T21:29:17Z

kevinwilfong
Sep 15, 2023
Collaborator

Presto's JSON functions in Java don't seem to ever look ahead in the string beyond the bare minimum it needs to to evaluate the function, which means it can tolerate bad JSON.

E.g.
Incomplete values
JSON_EXTRACT_SCALAR('{"k1":"v1"', '$.k1')
JSON_EXTRACT_SCALAR('["a", "b",', '$[1]')

Garbage data after the value
JSON_EXTRACT_SCALAR('{"k1":"v1"}abc', '$.k1')
JSON_EXTRACT_SCALAR('["a", "b"]xyz', '$[1]')

These all work fine in Presto, returning non-null results.

Velox doesn't support any of these cases today in the Presto JSON UDFs, and will return NULL. This is true for both the simdjson and folly based versions of the UDFs.

All of the cases I could find where the simdjson and Presto versions of the UDFs differed were due to the fact that simdjson will check that the root object/array ends with the correct closing brace (this is easy/efficient to check, just peek at the last character). Since the simdjson UDFs lazily parse the JSON (unlike the folly versions) other errors later in the string seem to be ignored, e.g.

Incomplete inner values
JSON_EXTRACT_SCALAR({"k1":{"k1":"v1",}', '$.k1.k1')
JSON_EXTRACT_SCALAR('[["a", "b",]', '$[0][1]')

Garbage data in the value
JSON_EXTRACT_SCALAR('{"k1":"v1"abc}', '$.k1')
JSON_EXTRACT_SCALAR('["a", "b"xyz]', '$[1]')

These all return non-null results in both Presto and Velox.

Does Presto want to continue supporting these cases? I assume yes, since it would mean a change in behavior for customers, but given the JSON is malformed I figured it's worth asking. Adding the check for the final character to match simdjson would be easy.
We can try to get a change accepted to simdjson to make that final character check configurable, but simdjson is not very configurable today and, with the exception of some buffer sizes, appears to be only configurable through compile time flags.
If none of the above work, we may need to find a JSON library parsing that is more lenient or implement our own, achieving exactly the same semantics as Presto in the face of malformed JSON may be a bit of a difficult balancing act.

spershin · 2023-09-15T22:11:56Z

spershin
Sep 15, 2023
Collaborator

My IMHO:

The best option.
The plan B.
"No man's land."

0 replies

mbasmanova · 2023-09-16T07:47:47Z

mbasmanova
Sep 16, 2023
Collaborator

@kevinwilfong Kevin, than you for describing this issue. Would it be possible to open a GitHub issue in simdjson project to ask whether they will be willing to make a change (2)?

1 reply

kevinwilfong Sep 19, 2023
Collaborator Author

There was a similar issue raised about 6 months back.
simdjson/simdjson#1976

They didn't express any interest in fixing it, but said they'd consider contributions.

mbasmanova · 2023-09-16T07:48:16Z

mbasmanova
Sep 16, 2023
Collaborator

CC: @wanweiqiangintel

0 replies

mbasmanova · 2023-09-16T07:49:23Z

mbasmanova
Sep 16, 2023
Collaborator

CC: @rui-mo @PHILO-HE Folks, what is the behavior of Spark?

4 replies

PHILO-HE Sep 18, 2023

Hi @mbasmanova, my test shows Spark doesn't behave same as Presto or Velox.

Incomplete values
JSON_EXTRACT_SCALAR('{"k1":"v1"', '$.k1')
JSON_EXTRACT_SCALAR('["a", "b",', '$[1]')

Spark produces NULL for the above cases (same as Velox, but different from Presto).

Garbage data after the value
JSON_EXTRACT_SCALAR('{"k1":"v1"}abc', '$.k1')
JSON_EXTRACT_SCALAR('["a", "b"]xyz', '$[1]')

Spark produces non-NULL for the above cases (same as Presto, but different from Velox).

Incomplete inner values
JSON_EXTRACT_SCALAR({"k1":{"k1":"v1",}', '$.k1.k1')
JSON_EXTRACT_SCALAR('[["a", "b",]', '$[0][1]')
Garbage data in the value
JSON_EXTRACT_SCALAR('{"k1":"v1"abc}', '$.k1')
JSON_EXTRACT_SCALAR('["a", "b"xyz]', '$[1]')

Spark produces NULL for the above cases (different from Presto/Velox).

mbasmanova Sep 18, 2023
Collaborator

@PHILO-HE Thank you for checking and sharing Spark's behavior. Do you happen to know if this behavior in Spark is intentional and the same in latest Spark or just "by chance"? What JSON parsing library is used in Spark?

PHILO-HE Sep 18, 2023

@mbasmanova, Spark uses jackson to parse JSON. The mentioned behavior is unchanged in latest spark.
I think the behavior on malformed JSON is NOT intentionally designed by Spark. It's just third-party lib's handling behavior.

It's definitely hard to fix all inconsistent cases. I once added some code to check whether adjacent closing symbol is illegal when extracting an element with Simdjson, which can just fix a few cases.

mbasmanova Sep 18, 2023
Collaborator

Got it. Presto uses Jayway: prestodb/presto#18025

mbasmanova · 2023-09-16T07:50:31Z

mbasmanova
Sep 16, 2023
Collaborator

@kevinwilfong Kevin, would you open an issue in Presto repo to ask whether this is a desired behavior and if so to update the documentation to make this clearer?

1 reply

kevinwilfong Sep 19, 2023
Collaborator Author

prestodb/presto#20916

EpsilonPrime · 2023-09-19T16:17:16Z

EpsilonPrime
Sep 19, 2023

If the end goal is for the query to behave the same whether it is sent to Presto, Spark, or Velox then an option might be to rewrite the query from:

JSON_EXTRACT_SCALAR(data, specifier)

to

IF(JSON_IS_VALID(data), JSON_EXTRACT_SCALAR(data, specifier), NULL)

I know there isn't a JSON_IS_VALID function on any of these systems but it can either be added or replaced with REGEXP_MATCH with the appropriate arguments.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Presto Java JSON functions support more malformed JSON than Velox #6597

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 6 comments 6 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Presto Java JSON functions support more malformed JSON than Velox #6597

kevinwilfong Sep 15, 2023 Collaborator

Replies: 6 comments · 6 replies

spershin Sep 15, 2023 Collaborator

mbasmanova Sep 16, 2023 Collaborator

kevinwilfong Sep 19, 2023 Collaborator Author

mbasmanova Sep 16, 2023 Collaborator

mbasmanova Sep 16, 2023 Collaborator

PHILO-HE Sep 18, 2023

mbasmanova Sep 18, 2023 Collaborator

PHILO-HE Sep 18, 2023

mbasmanova Sep 18, 2023 Collaborator

mbasmanova Sep 16, 2023 Collaborator

kevinwilfong Sep 19, 2023 Collaborator Author

EpsilonPrime Sep 19, 2023

kevinwilfong
Sep 15, 2023
Collaborator

Replies: 6 comments 6 replies

spershin
Sep 15, 2023
Collaborator

mbasmanova
Sep 16, 2023
Collaborator

kevinwilfong Sep 19, 2023
Collaborator Author

mbasmanova
Sep 16, 2023
Collaborator

mbasmanova
Sep 16, 2023
Collaborator

mbasmanova Sep 18, 2023
Collaborator

mbasmanova Sep 18, 2023
Collaborator

mbasmanova
Sep 16, 2023
Collaborator

kevinwilfong Sep 19, 2023
Collaborator Author

EpsilonPrime
Sep 19, 2023