-
Notifications
You must be signed in to change notification settings - Fork 1k
Change some panics to errors in parquet decoder #8602
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
17d5287
to
4d193b3
Compare
I guess we can lump this in with #7806 |
I'm happy to split this up into some separate PRs. I know it's a lot of random things as-is. |
The pedant in me wants to take you up on your offer, but there's not so much going on here that I think that's necessary. Maybe just change the title to something that sounds better 😉. ("Address panics found in external testing"). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @rambleraptor, these all look sensible to me.
Would it be possible to gin up some tests for at least some of them?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for digging into this! Several comments.
return Ok((end, buf.slice(i32_size..end))); | ||
} | ||
} | ||
Err(general_err!("not enough data to read levels")) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is definitely an improvement over the existing code, but it opens a question:
Given that we're reading bytes from a byte buffer, it seems like we must expect to hit this situation at least occasionally? And the correct response is to fetch more bytes, not fail? Is there some mechanism for handling that higher up in the call stack? Or is there some reason it should be impossible for this code to run off the end of the buffer?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also -- it seems like read_num_bytes
should do bounds checking internally and return Option<T>
, so buffer overrun is obvious at the call site instead of a hidden panic footgun? The method has a half dozen other callers, and they all need to do manual bounds checking, in various ways and with varying degrees of safety. In particular, parquet/src/data_type.rs has two call sites that lack any visible bounds checks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In this particular instance we're reading a buffer that should contain an entire page of data. If it doesn't, that likely points to a problem with the metadata.
Changes to read_num_bytes
would likely need more careful consideration as I suspect it might be used in some performance critical sections.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think changing read_num_bytes
to return Option
would be a good idea, as that would essentially replace the assert
in that method. That assert
is currently redundant if the caller already checks the bounds, so those two checks would be replaced by one against the option. In the BitReader
that might actually improve performance.
} else if !is_root_node { | ||
return Err(general_err!("Repetition level must be defined for non-root types")); | ||
} | ||
Ok((next_index, Arc::new(builder.build().unwrap()))) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How do we know the unwrap
is safe?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
build
never returns an Err
😉. But good point, could replace unwrap
with ?
.
Co-authored-by: Ed Seidl <[email protected]>
Co-authored-by: Ryan Johnson <[email protected]>
I took the liberty of merging up from main and fixing |
5176ace
to
cdd9f0a
Compare
let mut page_header = PageHeader::default(); | ||
page_header.r#type = PageType::DATA_PAGE_V2; | ||
page_header.uncompressed_page_size = 10; | ||
page_header.compressed_page_size = 10; | ||
let mut data_page_header_v2 = DataPageHeaderV2::default(); | ||
data_page_header_v2.definition_levels_byte_length = 11; // offset > uncompressed_page_size | ||
page_header.data_page_header_v2 = Some(data_page_header_v2); | ||
|
||
let buffer = Bytes::new(); | ||
let err = decode_page(page_header, buffer, Type::INT32, None).unwrap_err(); | ||
assert_eq!(err.to_string(), "Parquet error: Invalid page header"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We don't implement Default
for the thrift structs nor Debug
for Page
. Instead you'll have to explicitly instantiate the page header. Something like
let page_header = PageHeader {
r#type: PageType::DATA_PAGE_V2,
uncompressed_page_size: 10,
compressed_page_size: 10,
data_page_header: None,
index_page_header: None,
dictionary_page_header: None,
crc: None,
data_page_header_v2: Some(DataPageHeaderV2 {
num_nulls: 0,
num_rows: 0,
num_values: 0,
encoding: Encoding::PLAIN,
definition_levels_byte_length: 11,
repetition_levels_byte_length: 0,
is_compressed: None,
statistics: None,
}),
};
let buffer = Bytes::new();
match decode_page(page_header, buffer, Type::INT32, None) {
Err(e) => assert_eq!(e.to_string(), "Parquet error: Invalid page header"),
_ => panic!("should have failed"),
}
Thanks @rambleraptor. Your force push seems to have done away with the merge from head. Could you please merge in origin/main after fixing the failing tests? |
🤖 |
🤖: Benchmark completed Details
|
Rationale for this change
We've caused some unexpected panics from our internal testing. We've put in error checks for all of these so that they don't affect other users.
What changes are included in this PR?
Various error checks to ensure panics don't occur.
Are these changes tested?
Tests should continue to pass.
If tests are not included in your PR, please explain why (for example, are they covered by existing tests)?
Existing tests should cover these changes.
Are there any user-facing changes?
None.