Presto Java JSON functions support more malformed JSON than Velox #6597
Replies: 6 comments 6 replies
-
My IMHO:
|
Beta Was this translation helpful? Give feedback.
-
@kevinwilfong Kevin, than you for describing this issue. Would it be possible to open a GitHub issue in simdjson project to ask whether they will be willing to make a change (2)? |
Beta Was this translation helpful? Give feedback.
-
@kevinwilfong Kevin, would you open an issue in Presto repo to ask whether this is a desired behavior and if so to update the documentation to make this clearer? |
Beta Was this translation helpful? Give feedback.
-
If the end goal is for the query to behave the same whether it is sent to Presto, Spark, or Velox then an option might be to rewrite the query from:
to
I know there isn't a JSON_IS_VALID function on any of these systems but it can either be added or replaced with REGEXP_MATCH with the appropriate arguments. |
Beta Was this translation helpful? Give feedback.
-
Presto's JSON functions in Java don't seem to ever look ahead in the string beyond the bare minimum it needs to to evaluate the function, which means it can tolerate bad JSON.
E.g.
Incomplete values
JSON_EXTRACT_SCALAR('{"k1":"v1"', '$.k1')
JSON_EXTRACT_SCALAR('["a", "b",', '$[1]')
Garbage data after the value
JSON_EXTRACT_SCALAR('{"k1":"v1"}abc', '$.k1')
JSON_EXTRACT_SCALAR('["a", "b"]xyz', '$[1]')
These all work fine in Presto, returning non-null results.
Velox doesn't support any of these cases today in the Presto JSON UDFs, and will return NULL. This is true for both the simdjson and folly based versions of the UDFs.
All of the cases I could find where the simdjson and Presto versions of the UDFs differed were due to the fact that simdjson will check that the root object/array ends with the correct closing brace (this is easy/efficient to check, just peek at the last character). Since the simdjson UDFs lazily parse the JSON (unlike the folly versions) other errors later in the string seem to be ignored, e.g.
Incomplete inner values
JSON_EXTRACT_SCALAR({"k1":{"k1":"v1",}', '$.k1.k1')
JSON_EXTRACT_SCALAR('[["a", "b",]', '$[0][1]')
Garbage data in the value
JSON_EXTRACT_SCALAR('{"k1":"v1"abc}', '$.k1')
JSON_EXTRACT_SCALAR('["a", "b"xyz]', '$[1]')
These all return non-null results in both Presto and Velox.
Beta Was this translation helpful? Give feedback.
All reactions