-
Notifications
You must be signed in to change notification settings - Fork 425
refactor: remove Vectors from RecordBatch completely
#7184
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR refactors the internal data representation in RecordBatch to use Arrow arrays directly instead of maintaining a parallel VectorRef collection. The changes eliminate redundant data storage and simplify the codebase by relying on Arrow's native array types.
- Removed the
columns: Vec<VectorRef>field fromRecordBatchand now rely solely on the underlyingDfRecordBatch - Replaced
try_from_df_record_batch(which returnedResult) withfrom_df_record_batch(infallible) - Introduced helper functions (
timestamp_array_value,time_array_value,duration_array_value) to extract typed values from Arrow arrays
Reviewed Changes
Copilot reviewed 37 out of 37 changed files in this pull request and generated 3 comments.
Show a summary per file
| File | Description |
|---|---|
| src/common/recordbatch/src/recordbatch.rs | Core changes to remove columns field and expose Arrow arrays directly; added iter_column_as_string helper |
| src/datatypes/src/arrow_array.rs | Added helper functions for extracting timestamp, time, and duration values from Arrow arrays |
| src/table/src/table/numbers.rs | Updated to use infallible from_df_record_batch |
| src/table/src/table/scan.rs | Simplified memory size calculation using array_memory_size() |
| src/servers/src/http/result.rs | Introduced HttpOutputWriter to consolidate HTTP output formatting logic |
| src/servers/src/http/result/influxdb_result_v1.rs | Refactored to use HttpOutputWriter, removing 250+ lines of duplicated code |
| src/servers/src/prom_store.rs | Updated to use Arrow array methods and new helper functions |
| src/servers/src/postgres/types.rs | Simplified by using helper functions instead of manual time unit matching |
| src/servers/src/mysql/writer.rs | Simplified by using helper functions instead of manual time unit matching |
| tests-integration/tests/region_migration.rs | Updated test code to use Arrow's native as_primitive API |
| tests-integration/src/tests/instance_test.rs | Updated test assertions to compare Arrow arrays directly |
| src/flow/src/expr.rs | Changed From to TryFrom implementation for Batch |
| src/client/src/region.rs | Wrapped from_df_record_batch result in Ok() for infallible conversion |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| let mut vectors = HashMap::with_capacity(self.num_columns()); | ||
|
|
||
| // column schemas in recordbatch must match its vectors, otherwise it's corrupted | ||
| for (vector_schema, vector) in self.schema.column_schemas().iter().zip(self.columns.iter()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
seems we can remove this column_vectors, it's not in-use any more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems it's still used in "delete" and "insert select".
| let gt_schema = GtSchema::try_from(self.input.schema()).unwrap(); | ||
| let batch = GtRecordBatch::try_from_df_record_batch(Arc::new(gt_schema), batch).unwrap(); | ||
| let vectors = Helper::try_into_vectors(batch.columns()) | ||
| .map_err(|e| DataFusionError::Execution(e.to_string()))?; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can refactor this to switch to arrow array directly in future, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thought so, it requires more understanding of the underlying histogram fold implementation. I'll leave a todo first.
068bd84 to
21112fe
Compare
21112fe to
6e67c7d
Compare
86bfa30 to
f39c679
Compare
Signed-off-by: luofucong <[email protected]>
f39c679 to
c73499e
Compare
evenyag
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
I hereby agree to the terms of the GreptimeDB CLA.
Refer to a related PR or issue link (optional)
What's changed and what's your intention?
following #7163, now that
Vectors are not needed, they can be removed from ourRecordBatch, to save memoryPR Checklist
Please convert it to a draft if some of the following conditions are not met.