Skip to content

Conversation

@MichaelScofield
Copy link
Collaborator

I hereby agree to the terms of the GreptimeDB CLA.

Refer to a related PR or issue link (optional)

What's changed and what's your intention?

following #7163, now that Vectors are not needed, they can be removed from our RecordBatch, to save memory

PR Checklist

Please convert it to a draft if some of the following conditions are not met.

  • I have written the necessary rustdoc comments.
  • I have added the necessary unit tests and integration tests.
  • This PR requires documentation updates.
  • API changes are backward compatible.
  • Schema or data changes are backward compatible.

@github-actions github-actions bot added size/L docs-not-required This change does not impact docs. labels Nov 5, 2025
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR refactors the internal data representation in RecordBatch to use Arrow arrays directly instead of maintaining a parallel VectorRef collection. The changes eliminate redundant data storage and simplify the codebase by relying on Arrow's native array types.

  • Removed the columns: Vec<VectorRef> field from RecordBatch and now rely solely on the underlying DfRecordBatch
  • Replaced try_from_df_record_batch (which returned Result) with from_df_record_batch (infallible)
  • Introduced helper functions (timestamp_array_value, time_array_value, duration_array_value) to extract typed values from Arrow arrays

Reviewed Changes

Copilot reviewed 37 out of 37 changed files in this pull request and generated 3 comments.

Show a summary per file
File Description
src/common/recordbatch/src/recordbatch.rs Core changes to remove columns field and expose Arrow arrays directly; added iter_column_as_string helper
src/datatypes/src/arrow_array.rs Added helper functions for extracting timestamp, time, and duration values from Arrow arrays
src/table/src/table/numbers.rs Updated to use infallible from_df_record_batch
src/table/src/table/scan.rs Simplified memory size calculation using array_memory_size()
src/servers/src/http/result.rs Introduced HttpOutputWriter to consolidate HTTP output formatting logic
src/servers/src/http/result/influxdb_result_v1.rs Refactored to use HttpOutputWriter, removing 250+ lines of duplicated code
src/servers/src/prom_store.rs Updated to use Arrow array methods and new helper functions
src/servers/src/postgres/types.rs Simplified by using helper functions instead of manual time unit matching
src/servers/src/mysql/writer.rs Simplified by using helper functions instead of manual time unit matching
tests-integration/tests/region_migration.rs Updated test code to use Arrow's native as_primitive API
tests-integration/src/tests/instance_test.rs Updated test assertions to compare Arrow arrays directly
src/flow/src/expr.rs Changed From to TryFrom implementation for Batch
src/client/src/region.rs Wrapped from_df_record_batch result in Ok() for infallible conversion

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

let mut vectors = HashMap::with_capacity(self.num_columns());

// column schemas in recordbatch must match its vectors, otherwise it's corrupted
for (vector_schema, vector) in self.schema.column_schemas().iter().zip(self.columns.iter())
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

seems we can remove this column_vectors, it's not in-use any more.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems it's still used in "delete" and "insert select".

let gt_schema = GtSchema::try_from(self.input.schema()).unwrap();
let batch = GtRecordBatch::try_from_df_record_batch(Arc::new(gt_schema), batch).unwrap();
let vectors = Helper::try_into_vectors(batch.columns())
.map_err(|e| DataFusionError::Execution(e.to_string()))?;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can refactor this to switch to arrow array directly in future, right?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thought so, it requires more understanding of the underlying histogram fold implementation. I'll leave a todo first.

@MichaelScofield MichaelScofield marked this pull request as draft November 10, 2025 01:44
@MichaelScofield MichaelScofield force-pushed the refactor/remove-vector-from-recordbatch branch from 068bd84 to 21112fe Compare November 13, 2025 02:17
@MichaelScofield MichaelScofield marked this pull request as ready for review November 13, 2025 02:17
@MichaelScofield
Copy link
Collaborator Author

@sunng87 @evenyag PTAL

@MichaelScofield MichaelScofield force-pushed the refactor/remove-vector-from-recordbatch branch from 21112fe to 6e67c7d Compare November 13, 2025 11:20
@MichaelScofield MichaelScofield marked this pull request as draft November 14, 2025 11:30
@MichaelScofield MichaelScofield force-pushed the refactor/remove-vector-from-recordbatch branch 4 times, most recently from 86bfa30 to f39c679 Compare November 18, 2025 08:56
@MichaelScofield MichaelScofield force-pushed the refactor/remove-vector-from-recordbatch branch from f39c679 to c73499e Compare November 18, 2025 10:55
@MichaelScofield MichaelScofield marked this pull request as ready for review November 18, 2025 11:26
Copy link
Contributor

@evenyag evenyag left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

docs-not-required This change does not impact docs. size/L

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants