-
Notifications
You must be signed in to change notification settings - Fork 1k
Change some panics to errors in parquet decoder #8602
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -392,6 +392,9 @@ pub(crate) fn decode_page( | |
let buffer = match decompressor { | ||
Some(decompressor) if can_decompress => { | ||
let uncompressed_page_size = usize::try_from(page_header.uncompressed_page_size)?; | ||
if offset > buffer.len() || offset > uncompressed_page_size { | ||
return Err(general_err!("Invalid page header")); | ||
} | ||
let decompressed_size = uncompressed_page_size - offset; | ||
let mut decompressed = Vec::with_capacity(uncompressed_page_size); | ||
decompressed.extend_from_slice(&buffer.as_ref()[..offset]); | ||
|
@@ -458,7 +461,7 @@ pub(crate) fn decode_page( | |
} | ||
_ => { | ||
// For unknown page type (e.g., INDEX_PAGE), skip and read next. | ||
unimplemented!("Page type {:?} is not supported", page_header.r#type) | ||
return Err(general_err!("Page type {:?} is not supported", page_header.r#type)); | ||
} | ||
}; | ||
|
||
|
@@ -1139,6 +1142,35 @@ mod tests { | |
|
||
use super::*; | ||
|
||
#[test] | ||
fn test_decode_page_invalid_offset() { | ||
use crate::file::metadata::thrift_gen::DataPageHeaderV2; | ||
|
||
let mut page_header = PageHeader::default(); | ||
page_header.r#type = PageType::DATA_PAGE_V2; | ||
page_header.uncompressed_page_size = 10; | ||
page_header.compressed_page_size = 10; | ||
let mut data_page_header_v2 = DataPageHeaderV2::default(); | ||
data_page_header_v2.definition_levels_byte_length = 11; // offset > uncompressed_page_size | ||
page_header.data_page_header_v2 = Some(data_page_header_v2); | ||
|
||
let buffer = Bytes::new(); | ||
let err = decode_page(page_header, buffer, Type::INT32, None).unwrap_err(); | ||
assert_eq!(err.to_string(), "Parquet error: Invalid page header"); | ||
Comment on lines
+1149
to
+1159
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. We don't implement let page_header = PageHeader {
r#type: PageType::DATA_PAGE_V2,
uncompressed_page_size: 10,
compressed_page_size: 10,
data_page_header: None,
index_page_header: None,
dictionary_page_header: None,
crc: None,
data_page_header_v2: Some(DataPageHeaderV2 {
num_nulls: 0,
num_rows: 0,
num_values: 0,
encoding: Encoding::PLAIN,
definition_levels_byte_length: 11,
repetition_levels_byte_length: 0,
is_compressed: None,
statistics: None,
}),
};
let buffer = Bytes::new();
match decode_page(page_header, buffer, Type::INT32, None) {
Err(e) => assert_eq!(e.to_string(), "Parquet error: Invalid page header"),
_ => panic!("should have failed"),
} |
||
} | ||
|
||
#[test] | ||
fn test_decode_unsupported_page() { | ||
let mut page_header = PageHeader::default(); | ||
page_header.r#type = PageType::INDEX_PAGE; | ||
let buffer = Bytes::new(); | ||
let err = decode_page(page_header, buffer, Type::INT32, None).unwrap_err(); | ||
assert_eq!( | ||
err.to_string(), | ||
"Parquet error: Page type INDEX_PAGE is not supported" | ||
); | ||
} | ||
|
||
#[test] | ||
fn test_cursor_and_file_has_the_same_behaviour() { | ||
let mut buf: Vec<u8> = Vec::new(); | ||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -1348,17 +1348,19 @@ fn schema_from_array_helper<'a>( | |
.with_logical_type(logical_type) | ||
.with_fields(fields) | ||
.with_id(field_id); | ||
if let Some(rep) = repetition { | ||
// Sometimes parquet-cpp and parquet-mr set repetition level REQUIRED or | ||
// REPEATED for root node. | ||
// | ||
// We only set repetition for group types that are not top-level message | ||
// type. According to parquet-format: | ||
// Root of the schema does not have a repetition_type. | ||
// All other types must have one. | ||
if !is_root_node { | ||
builder = builder.with_repetition(rep); | ||
} | ||
|
||
// Sometimes parquet-cpp and parquet-mr set repetition level REQUIRED or | ||
// REPEATED for root node. | ||
// | ||
// We only set repetition for group types that are not top-level message | ||
// type. According to parquet-format: | ||
// Root of the schema does not have a repetition_type. | ||
// All other types must have one. | ||
if !is_root_node { | ||
let Some(rep) = repetition else { | ||
return Err(general_err!("Repetition level must be defined for non-root types")); | ||
}; | ||
builder = builder.with_repetition(rep); | ||
} | ||
Ok((next_index, Arc::new(builder.build().unwrap()))) | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. How do we know the There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
|
||
} | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is definitely an improvement over the existing code, but it opens a question:
Given that we're reading bytes from a byte buffer, it seems like we must expect to hit this situation at least occasionally? And the correct response is to fetch more bytes, not fail? Is there some mechanism for handling that higher up in the call stack? Or is there some reason it should be impossible for this code to run off the end of the buffer?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also -- it seems like
read_num_bytes
should do bounds checking internally and returnOption<T>
, so buffer overrun is obvious at the call site instead of a hidden panic footgun? The method has a half dozen other callers, and they all need to do manual bounds checking, in various ways and with varying degrees of safety. In particular, parquet/src/data_type.rs has two call sites that lack any visible bounds checks.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In this particular instance we're reading a buffer that should contain an entire page of data. If it doesn't, that likely points to a problem with the metadata.
Changes to
read_num_bytes
would likely need more careful consideration as I suspect it might be used in some performance critical sections.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think changing
read_num_bytes
to returnOption
would be a good idea, as that would essentially replace theassert
in that method. Thatassert
is currently redundant if the caller already checks the bounds, so those two checks would be replaced by one against the option. In theBitReader
that might actually improve performance.