-
Notifications
You must be signed in to change notification settings - Fork 435
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[CH] simdjson doesn't support strings which contain invalid unicodes #7849
Comments
@lgbo-ustc, in velox, we made simdjson skip utf8 validation by setting
Is this setting applicable to fixing your issue? |
Thanks, @PHILO-HE . We try this option, but it doesn't work. We print the calling stack as following
And found that we must let simdjson_warn_unused simdjson_inline error_code tape_builder::visit_string(json_iterator &iter, const uint8_t *value, bo
ol key) noexcept {
iter.log_value(key ? "key" : "string");
uint8_t *dst = on_start_string(iter);
dst = stringparsing::parse_string(value+1, dst, false); // We do not allow replacement when the escape characters are
invalid.
std::cout << "xxx dist is null " << (dst == nullptr) << "\n";
if (dst == nullptr) {
iter.log_error("Invalid escape in string");
return STRING_ERROR;
}
on_end_string(dst);
return SUCCESS;
}
Which api of simdjson do you use, ondemand or dom ? |
@lgbo-ustc, we are using ondemand api which may only validate part of JSON until given JSON path is found, not whole JSON string. |
What we encountered is a case like the below: |
Backend
CH (ClickHouse)
Bug description
[Expected behavior] and [actual behavior].
simdjson fails to parse following json
\udee4
and\udff0
are invalid unicodesSpark version
None
Spark configurations
No response
System information
No response
Relevant logs
No response
The text was updated successfully, but these errors were encountered: