Skip to content

Fix StringView buffer bloat in stream_next FFI export#21753

Merged
mch2 merged 6 commits into
opensearch-project:mainfrom
bowenlan-amzn:fix/stringview-gc-pr
May 21, 2026
Merged

Fix StringView buffer bloat in stream_next FFI export#21753
mch2 merged 6 commits into
opensearch-project:mainfrom
bowenlan-amzn:fix/stringview-gc-pr

Conversation

@bowenlan-amzn
Copy link
Copy Markdown
Member

@bowenlan-amzn bowenlan-amzn commented May 20, 2026

Description

StringViewArray::slice() shares ALL backing buffers via Arc. When DataFusion's hash aggregate emits via EmitTo::All and slices the result into per-batch chunks, each batch carries the full backing buffer pool — not just the strings it references. This caused up to 435x data amplification on multi-shard aggregate queries with many distinct string values.

Root cause: Arrow IPC and C Data Interface intentionally do not compact StringView backing buffers during serialization (arrow-rs #5513). Compaction is the caller's responsibility at the serialization boundary.

Fix: Add compact_string_view_columns() in stream_next() before FFI export. Calls gc() on StringView/BinaryView columns only when backing buffers are significantly over-allocated. Three-level precondition avoids unnecessary work:

stream_next() per batch:
  1. has_views flag?         → No: skip (O(1), set once at stream creation)
  2. view_needs_gc()?        → No: buffers compact (O(n) scan, 2x + 10KB threshold)
  3. gc()                    → compact the bloated columns (O(n) copy)

Performance

Query type Cost per batch
No StringView columns (numeric agg, scan) Zero — O(1) bool check
StringView but not sliced (scan, filter) O(n) view scan, no allocation
Sliced StringView (hash aggregate emit) O(n) scan + copy into compact buffer

Validated on 4-shard ClickBench q19 (14M groups, SearchPhrase):

  • Before: 9748 batches × 174MB = 1.7TB transferred, 12 minutes
  • After: 9748 batches × 0.4MB = 4GB transferred, 11 seconds

Testing

# Unit tests (function correctness + optimization)
cargo test -p opensearch-datafusion --lib -- api::tests

# Integration test (regression guard for stream_next call site)
cargo test -p opensearch-datafusion --test stringview_gc_test

The integration test feeds a sliced 10K→100 StringView batch through df_stream_next and asserts output buffers are compact (<10KB). Fails immediately if the gc() call is removed (gets 310KB instead).

Check List

  • New functionality includes testing
  • New functionality has been documented
  • Commits are signed off

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 20, 2026

PR Reviewer Guide 🔍

(Review updated until commit fbd7a83)

Here are some key observations to aid the review process:

🧪 PR contains tests
🔒 No security concerns identified
✅ No TODO sections
🔀 No multiple PR themes
⚡ Recommended focus areas for review

Panic on downcast

The expect() calls in compact_string_view_columns() will panic if the schema declares Utf8View/BinaryView but the actual column is a different type (e.g., due to a corrupted batch or schema mismatch). This can crash the query stream. Consider returning a DataFusionError instead of panicking, or at minimum add a debug assertion that validates schema consistency before processing.

            let view: &arrow_array::StringViewArray = col.as_any().downcast_ref()
                .expect("column must be StringViewArray when schema declares Utf8View");
            view_needs_gc(view.data_buffers(), view.total_buffer_bytes_used())
        }
        DataType::BinaryView => {
            let view: &arrow_array::BinaryViewArray = col.as_any().downcast_ref()
                .expect("column must be BinaryViewArray when schema declares BinaryView");
            view_needs_gc(view.data_buffers(), view.total_buffer_bytes_used())
        }
        _ => false,
    });
if !needs_compaction {
    return batch;
}
let columns: Vec<Arc<dyn Array>> = batch
    .columns()
    .iter()
    .zip(schema.fields().iter())
    .map(|(col, field)| match field.data_type() {
        DataType::Utf8View => {
            let view: &arrow_array::StringViewArray = col.as_any().downcast_ref()
                .expect("column must be StringViewArray when schema declares Utf8View");
            Arc::new(view.gc()) as Arc<dyn Array>
        }
        DataType::BinaryView => {
            let view: &arrow_array::BinaryViewArray = col.as_any().downcast_ref()
                .expect("column must be BinaryViewArray when schema declares BinaryView");
            Arc::new(view.gc()) as Arc<dyn Array>
Unchecked unwrap

Line 525 uses expect() when constructing the compacted RecordBatch. If the gc'd columns somehow violate the schema (e.g., wrong length after gc, though unlikely), this will panic and crash the stream. Since this is in the FFI export path, a panic here terminates the query without a recoverable error. Consider propagating the error as DataFusionError instead.

RecordBatch::try_new(schema, columns).expect("gc'd columns must match schema")

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 20, 2026

PR Code Suggestions ✨

Latest suggestions up to fbd7a83

Explore these optional code suggestions:

CategorySuggestion                                                                                                                                    Impact
Possible issue
Handle zero bytes edge case

The function may incorrectly trigger GC when bytes_used is zero, as bytes_allocated
> 2 * 0 is always true for non-empty buffers. Add an early return to skip GC when
bytes_used is zero or when bytes_allocated is zero to avoid division-by-zero-like
edge cases.

sandbox/plugins/analytics-backend-datafusion/rust/src/api.rs [532-537]

 fn view_needs_gc(buffers: &[arrow::buffer::Buffer], bytes_used: usize) -> bool {
+    if bytes_used == 0 {
+        return false;
+    }
     let bytes_allocated: usize = buffers.iter().map(|b| b.len()).sum();
+    if bytes_allocated == 0 {
+        return false;
+    }
     let waste = bytes_allocated.saturating_sub(bytes_used);
     let is_significantly_bloated = bytes_allocated > 2 * bytes_used;
     is_significantly_bloated && waste > GC_MIN_WASTE_BYTES
 }
Suggestion importance[1-10]: 7

__

Why: The suggestion correctly identifies an edge case where bytes_used is zero, which would make bytes_allocated > 2 * bytes_used always true for non-empty buffers. Adding early returns for zero values prevents unnecessary GC operations and improves robustness. However, this is a defensive programming improvement rather than a critical bug fix, as the existing tests don't appear to exercise this edge case.

Medium

Previous suggestions

Suggestions up to commit 681d734
CategorySuggestion                                                                                                                                    Impact
Possible issue
Handle empty buffer edge case

The bloat detection logic has a critical flaw when bytes_allocated is zero (empty
buffers). Division by zero or incorrect comparisons can occur. Add an early return
to handle the empty buffer case before performing arithmetic operations.

sandbox/plugins/analytics-backend-datafusion/rust/src/api.rs [532-537]

 fn view_needs_gc(buffers: &[arrow::buffer::Buffer], bytes_used: usize) -> bool {
     let bytes_allocated: usize = buffers.iter().map(|b| b.len()).sum();
+    if bytes_allocated == 0 {
+        return false;
+    }
     let waste = bytes_allocated.saturating_sub(bytes_used);
     let is_significantly_bloated = bytes_allocated > 2 * bytes_used;
     is_significantly_bloated && waste > GC_MIN_WASTE_BYTES
 }
Suggestion importance[1-10]: 3

__

Why: While the suggestion correctly identifies a potential edge case with empty buffers, the existing code already handles this safely. When bytes_allocated is 0, saturating_sub returns 0, is_significantly_bloated is false (0 > 0 is false), and the function returns false. No division occurs. The early return adds clarity but doesn't fix a critical flaw.

Low
Suggestions up to commit 68f0a34
CategorySuggestion                                                                                                                                    Impact
Possible issue
Handle zero bytes_used edge case

The bloat detection logic has a critical flaw when bytes_used is zero. If a
StringView array has empty strings or all inline values, bytes_used will be 0,
causing bytes_allocated > 2 * 0 to always be true even for tiny allocations. This
triggers unnecessary GC operations. Add an early return when bytes_used is zero or
when bytes_allocated is below a minimum threshold.

sandbox/plugins/analytics-backend-datafusion/rust/src/api.rs [532-537]

 fn view_needs_gc(buffers: &[arrow::buffer::Buffer], bytes_used: usize) -> bool {
     let bytes_allocated: usize = buffers.iter().map(|b| b.len()).sum();
+    if bytes_allocated == 0 || bytes_used == 0 {
+        return false;
+    }
     let waste = bytes_allocated.saturating_sub(bytes_used);
     let is_significantly_bloated = bytes_allocated > 2 * bytes_used;
     is_significantly_bloated && waste > GC_MIN_WASTE_BYTES
 }
Suggestion importance[1-10]: 8

__

Why: The suggestion correctly identifies a critical edge case where bytes_used is zero, which would cause bytes_allocated > 2 * 0 to always be true for any non-zero allocation. This could trigger unnecessary GC operations for arrays with inline strings or empty data, impacting performance.

Medium
General
Check batch schema dynamically

The has_views flag is set once during handle creation based on the stream schema,
but it doesn't account for dynamic schema changes during streaming. If the schema
evolves or differs across batches, view columns might be missed or non-view batches
might be unnecessarily processed. Consider checking the actual batch schema instead
of relying on the cached flag.

sandbox/plugins/analytics-backend-datafusion/rust/src/api.rs [470-474]

-let batch = if handle.has_views {
+let batch = if Self::schema_has_views(&batch.schema()) {
     compact_string_view_columns(batch)
 } else {
     batch
 };
Suggestion importance[1-10]: 3

__

Why: While the suggestion raises a valid theoretical concern about schema changes, DataFusion streams typically have fixed schemas. The cached has_views flag is an optimization to avoid repeated schema checks. The suggested change would add overhead by checking every batch schema, which is likely unnecessary in practice.

Low
Suggestions up to commit c3cae77
CategorySuggestion                                                                                                                                    Impact
General
Verify schema consistency across batches

The has_views flag is set once during handle creation but never updated if the
stream schema changes mid-execution. If DataFusion produces batches with different
schemas (e.g., union queries), this could cause view columns to bypass compaction.
Consider checking the batch schema dynamically or documenting the assumption that
schema remains constant.

sandbox/plugins/analytics-backend-datafusion/rust/src/api.rs [470-474]

-let batch = if handle.has_views {
+let batch = if handle.has_views || QueryStreamHandle::schema_has_views(&batch.schema()) {
     compact_string_view_columns(batch)
 } else {
     batch
 };
Suggestion importance[1-10]: 7

__

Why: The suggestion raises a valid concern about schema changes mid-stream (e.g., union queries). However, checking schema_has_views on every batch adds overhead. The current approach assumes schema stability, which is typical but not guaranteed. This warrants verification or documentation of the assumption.

Medium
Handle empty buffer edge case

The bloat detection logic has a critical flaw when bytes_allocated is zero (empty
buffers). The condition bytes_allocated > 2 * bytes_used evaluates to 0 > 0 (false),
but if bytes_used is also zero, the ratio check becomes meaningless. Add an early
return for the zero-allocation case to avoid unnecessary computation and clarify
intent.

sandbox/plugins/analytics-backend-datafusion/rust/src/api.rs [532-537]

 fn view_needs_gc(buffers: &[arrow::buffer::Buffer], bytes_used: usize) -> bool {
     let bytes_allocated: usize = buffers.iter().map(|b| b.len()).sum();
+    if bytes_allocated == 0 {
+        return false;
+    }
     let waste = bytes_allocated.saturating_sub(bytes_used);
     let is_significantly_bloated = bytes_allocated > 2 * bytes_used;
     is_significantly_bloated && waste > GC_MIN_WASTE_BYTES
 }
Suggestion importance[1-10]: 4

__

Why: While the suggestion correctly identifies that bytes_allocated == 0 is a valid edge case, the current logic already handles it correctly: when bytes_allocated is 0, bytes_allocated > 2 * bytes_used is false, and the function returns false. The early return adds minor clarity but doesn't fix a bug.

Low
Suggestions up to commit 2be0546
CategorySuggestion                                                                                                                                    Impact
Possible issue
Handle zero bytes_used edge case

The bloat detection logic has a critical flaw when bytes_used is zero. If a
StringView array has empty strings or all inline values, bytes_used will be 0,
causing bytes_allocated > 2 * 0 to always be true even for tiny buffers. This
triggers unnecessary GC operations. Add an early return when bytes_used is zero or
when bytes_allocated is below a minimum threshold.

sandbox/plugins/analytics-backend-datafusion/rust/src/api.rs [532-537]

 fn view_needs_gc(buffers: &[arrow::buffer::Buffer], bytes_used: usize) -> bool {
     let bytes_allocated: usize = buffers.iter().map(|b| b.len()).sum();
+    if bytes_allocated == 0 || bytes_used == 0 {
+        return false;
+    }
     let waste = bytes_allocated.saturating_sub(bytes_used);
     let is_significantly_bloated = bytes_allocated > 2 * bytes_used;
     is_significantly_bloated && waste > GC_MIN_WASTE_BYTES
 }
Suggestion importance[1-10]: 7

__

Why: The suggestion correctly identifies a potential issue where bytes_used being zero could cause incorrect bloat detection. However, the impact is moderate since the waste > GC_MIN_WASTE_BYTES check provides some protection, and empty/inline-only arrays are edge cases that may not trigger significant problems in practice.

Medium
General
Detect view columns per batch

The has_views flag is set once during handle creation based on the stream schema,
but it doesn't account for dynamic schema changes during execution. If the stream's
schema evolves (e.g., through projections or transformations), the flag may become
stale, causing view columns to be missed or non-view batches to be unnecessarily
processed.

sandbox/plugins/analytics-backend-datafusion/rust/src/api.rs [470-474]

-let batch = if handle.has_views {
+let batch = if handle.has_views || QueryStreamHandle::schema_has_views(&batch.schema()) {
     compact_string_view_columns(batch)
 } else {
     batch
 };
Suggestion importance[1-10]: 3

__

Why: While theoretically possible, schema changes during stream execution are extremely rare in DataFusion's execution model. The has_views flag optimization avoids redundant schema checks on every batch. The suggestion would add unnecessary overhead for a scenario that is unlikely to occur in practice.

Low
Suggestions up to commit 41069b3
CategorySuggestion                                                                                                                                    Impact
General
Extract view detection into helper method

The has_views flag is computed twice (in both constructors) with identical logic.
Extract this into a helper method to avoid code duplication and ensure consistency
if the detection logic needs to change.

sandbox/plugins/analytics-backend-datafusion/rust/src/api.rs [85-87]

-let has_views = stream.schema().fields().iter().any(|f| {
-    matches!(f.data_type(), DataType::Utf8View | DataType::BinaryView)
-});
+fn schema_has_view_types(schema: &arrow_schema::SchemaRef) -> bool {
+    schema.fields().iter().any(|f| {
+        matches!(f.data_type(), DataType::Utf8View | DataType::BinaryView)
+    })
+}
 
+// In constructors:
+let has_views = Self::schema_has_view_types(&stream.schema());
+
Suggestion importance[1-10]: 6

__

Why: The suggestion correctly identifies code duplication in lines 85-87 and 101-103. Extracting this logic into a helper method would improve maintainability and reduce duplication, though the impact is moderate since the duplicated code is simple and unlikely to change frequently.

Low
Prevent potential integer overflow

The function could panic on integer overflow if bytes_allocated or bytes_used are
extremely large. Use checked arithmetic or saturating operations to prevent
potential overflow issues in production.

sandbox/plugins/analytics-backend-datafusion/rust/src/api.rs [531-534]

 fn view_needs_gc(buffers: &[arrow::buffer::Buffer], bytes_used: usize) -> bool {
     let bytes_allocated: usize = buffers.iter().map(|b| b.len()).sum();
-    bytes_allocated > 2 * bytes_used && (bytes_allocated - bytes_used) > 10_240
+    bytes_allocated.saturating_sub(bytes_used) > 10_240 
+        && bytes_allocated > bytes_used.saturating_mul(2)
 }
Suggestion importance[1-10]: 5

__

Why: The suggestion addresses a theoretical overflow risk in view_needs_gc. While using saturating_sub and saturating_mul is safer, the likelihood of overflow with buffer sizes is extremely low in practice. The improved code also changes the logic slightly (checking conditions in different order), which may affect readability without significant safety benefit.

Low

@bowenlan-amzn bowenlan-amzn force-pushed the fix/stringview-gc-pr branch from f679645 to 8ba38ce Compare May 20, 2026 04:45
@github-actions
Copy link
Copy Markdown
Contributor

Persistent review updated to latest commit 8ba38ce

@github-actions
Copy link
Copy Markdown
Contributor

Persistent review updated to latest commit f877a9a

@bowenlan-amzn bowenlan-amzn force-pushed the fix/stringview-gc-pr branch from f877a9a to 09147cb Compare May 20, 2026 05:15
@github-actions
Copy link
Copy Markdown
Contributor

Persistent review updated to latest commit 09147cb

@bowenlan-amzn bowenlan-amzn force-pushed the fix/stringview-gc-pr branch from 09147cb to 8ba38ce Compare May 20, 2026 05:46
@github-actions
Copy link
Copy Markdown
Contributor

Persistent review updated to latest commit 8ba38ce

@github-actions
Copy link
Copy Markdown
Contributor

Persistent review updated to latest commit 373b21a

@github-actions
Copy link
Copy Markdown
Contributor

Persistent review updated to latest commit 304654d

@github-actions
Copy link
Copy Markdown
Contributor

✅ Gradle check result for 304654d: SUCCESS

@codecov
Copy link
Copy Markdown

codecov Bot commented May 20, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 73.48%. Comparing base (a6b5e43) to head (fbd7a83).
⚠️ Report is 4 commits behind head on main.

Additional details and impacted files
@@             Coverage Diff              @@
##               main   #21753      +/-   ##
============================================
+ Coverage     73.43%   73.48%   +0.04%     
- Complexity    75103    75131      +28     
============================================
  Files          6016     6016              
  Lines        341072   341072              
  Branches      49091    49091              
============================================
+ Hits         250469   250621     +152     
+ Misses        70682    70452     -230     
- Partials      19921    19999      +78     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@bowenlan-amzn bowenlan-amzn force-pushed the fix/stringview-gc-pr branch from 304654d to ef42d8c Compare May 20, 2026 16:09
@bowenlan-amzn bowenlan-amzn marked this pull request as ready for review May 20, 2026 16:09
@bowenlan-amzn bowenlan-amzn requested a review from a team as a code owner May 20, 2026 16:09
@bowenlan-amzn bowenlan-amzn force-pushed the fix/stringview-gc-pr branch from d2f9458 to ed1305d Compare May 20, 2026 16:10
@github-actions
Copy link
Copy Markdown
Contributor

Persistent review updated to latest commit ed1305d

@github-actions
Copy link
Copy Markdown
Contributor

❌ Gradle check result for ed1305d: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

@github-actions
Copy link
Copy Markdown
Contributor

Persistent review updated to latest commit 46abccb

@bharath-techie
Copy link
Copy Markdown
Contributor

Similar fix in DF - apache/datafusion#20381

@github-actions
Copy link
Copy Markdown
Contributor

✅ Gradle check result for 46abccb: SUCCESS

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 20, 2026

PR Code Analyzer ❗

AI-powered 'Code-Diff-Analyzer' found issues on commit 3678ddb.

PathLineSeverityDescription
sandbox/libs/dataformat-native/rust/Cargo.toml91highSupply chain risk: [patch.crates-io] redirects 24 datafusion crates (datafusion, datafusion-catalog, datafusion-common, datafusion-execution, datafusion-expr, datafusion-optimizer, datafusion-physical-plan, datafusion-substrait, and 16 others) away from crates.io to a personal GitHub fork (bowenlan-amzn/datafusion.git, branch opensearch-53.1.0-patched). Maintainers cannot verify that this fork contains only the stated cherry-picks (#21633, #20381) without auditing every commit on that branch. Namespace control over a personal fork is not equivalent to the official apache/arrow-datafusion repository.

The table above displays the top 10 most important findings.

Total: 1 | Critical: 0 | High: 1 | Medium: 0 | Low: 0


Pull Requests Author(s): Please update your Pull Request according to the report above.

Repository Maintainer(s): You can bypass diff analyzer by adding label skip-diff-analyzer after reviewing the changes carefully, then re-run failed actions. To re-enable the analyzer, remove the label, then re-run all actions.


⚠️ Note: The Code-Diff-Analyzer helps protect against potentially harmful code patterns. Please ensure you have thoroughly reviewed the changes beforehand.

Thanks.

@bowenlan-amzn bowenlan-amzn force-pushed the fix/stringview-gc-pr branch 3 times, most recently from 3678ddb to a4ec8f2 Compare May 20, 2026 19:40
@github-actions
Copy link
Copy Markdown
Contributor

Persistent review updated to latest commit a4ec8f2

Comment thread sandbox/plugins/analytics-backend-datafusion/rust/src/api.rs
@bowenlan-amzn bowenlan-amzn force-pushed the fix/stringview-gc-pr branch from a4ec8f2 to e8b820d Compare May 20, 2026 21:40
@github-actions
Copy link
Copy Markdown
Contributor

Persistent review updated to latest commit e8b820d

@bowenlan-amzn bowenlan-amzn force-pushed the fix/stringview-gc-pr branch from e8b820d to 41069b3 Compare May 20, 2026 21:42
@github-actions
Copy link
Copy Markdown
Contributor

Persistent review updated to latest commit 41069b3

Copy link
Copy Markdown
Contributor

@expani expani left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great finding and deep dive @bowenlan-amzn 👏

Comment thread sandbox/plugins/analytics-backend-datafusion/rust/src/api.rs Outdated
Comment thread sandbox/plugins/analytics-backend-datafusion/rust/src/api.rs Outdated
Comment thread sandbox/plugins/analytics-backend-datafusion/rust/src/api.rs
@github-actions
Copy link
Copy Markdown
Contributor

✅ Gradle check result for 41069b3: SUCCESS

@bowenlan-amzn bowenlan-amzn force-pushed the fix/stringview-gc-pr branch from 41069b3 to 2be0546 Compare May 20, 2026 22:47
@github-actions
Copy link
Copy Markdown
Contributor

Persistent review updated to latest commit 2be0546

@bowenlan-amzn bowenlan-amzn force-pushed the fix/stringview-gc-pr branch from 2be0546 to c3cae77 Compare May 20, 2026 22:56
@github-actions
Copy link
Copy Markdown
Contributor

Persistent review updated to latest commit c3cae77

@github-actions
Copy link
Copy Markdown
Contributor

❌ Gradle check result for c3cae77: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

@github-actions
Copy link
Copy Markdown
Contributor

Persistent review updated to latest commit 68f0a34

@github-actions
Copy link
Copy Markdown
Contributor

❌ Gradle check result for 68f0a34: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

StringViewArray::slice() shares ALL backing buffers via Arc::clone.
When DataFusion's hash aggregate emits output (EmitTo::All + slice into
8192-row batches), each slice carries the full backing buffer pool.
For a 14M-group aggregate with SearchPhrase strings, this means 174MB
per batch instead of 0.4MB — a 435x amplification.

Add compact_string_view_columns() in stream_next() that calls gc() on
Utf8View/BinaryView columns before C Data Interface export. This
compacts each batch to contain only its own referenced strings.

Validated on 4-shard ClickBench q19:
  Before: 9748 batches × 174MB = 1.7TB, 12 minutes
  After:  9748 batches × 0.4MB = 4GB, 11 seconds (65x faster)

Signed-off-by: bowenlan-amzn <bowenlan23@gmail.com>
Regression guard for e2fd9bc (StringView buffer bloat fix). Tests prove
that sliced StringView/BinaryView batches carry inflated backing buffers
and that gc() compacts them to proportional size. Covers: large buffer
compaction, inline-only no-op, empty array safety, BinaryView parity,
and non-view passthrough.

Signed-off-by: bowenlan-amzn <bowenlan23@gmail.com>
compact_string_view_columns now checks whether backing buffers are
over-allocated before calling gc(). Non-sliced batches (common case)
pay only an O(n) view scan instead of a full buffer copy.

New tests:
- view_needs_gc_detects_bloat: proves detection correctly identifies
  sliced arrays vs non-sliced arrays
- non_sliced_batch_skips_gc: proves non-sliced batches pass through
  without allocation/copy

Signed-off-by: bowenlan-amzn <bowenlan23@gmail.com>
- New integration test (stringview_gc_test.rs): feeds a sliced 10K→100
  StringView batch through df_stream_next and asserts the output backing
  buffers are compact (<10KB, not ~300KB). Fails immediately if
  compact_string_view_columns is removed from stream_next — the actual
  regression guard for this fix.

- Fix non_sliced_batch_skips_gc: use Arc::ptr_eq to prove the fast path
  returns the original column without copying, not just that sizes match.

Signed-off-by: bowenlan-amzn <bowenlan23@gmail.com>
Signed-off-by: bowenlan-amzn <bowenlan23@gmail.com>
@bowenlan-amzn bowenlan-amzn force-pushed the fix/stringview-gc-pr branch from 68f0a34 to 681d734 Compare May 20, 2026 23:21
@github-actions
Copy link
Copy Markdown
Contributor

Persistent review updated to latest commit 681d734

@github-actions
Copy link
Copy Markdown
Contributor

❌ Gradle check result for 681d734: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

The makeNodeStatsWithResourceUsage helper was missing the
AnalyticsBackendNativeMemoryStats parameter added in opensearch-project#21637.

Signed-off-by: bowenlan-amzn <bowenlan23@gmail.com>
@github-actions
Copy link
Copy Markdown
Contributor

Persistent review updated to latest commit fbd7a83

@github-actions
Copy link
Copy Markdown
Contributor

✅ Gradle check result for fbd7a83: SUCCESS

@mch2 mch2 merged commit 2696b95 into opensearch-project:main May 21, 2026
15 of 16 checks passed
@bowenlan-amzn bowenlan-amzn deleted the fix/stringview-gc-pr branch May 21, 2026 01:07
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants