Skip to content

Fix: compact view buffers in ScalarValue::compact for all container t…#21934

Merged
alamb merged 5 commits into
apache:mainfrom
bert-beyondloops:scalar_value_compact_view_types
May 22, 2026
Merged

Fix: compact view buffers in ScalarValue::compact for all container t…#21934
alamb merged 5 commits into
apache:mainfrom
bert-beyondloops:scalar_value_compact_view_types

Conversation

@bert-beyondloops
Copy link
Copy Markdown
Contributor

Which issue does this PR close?

Rationale for this change

ScalarValue::compact copies array data for container types to release sliced-buffer overhead, but never called .gc() on nested StringViewArray / BinaryViewArray. A scalar extracted from a large batch would therefore still hold a reference to the entire original view backing buffer, negating the benefit of compaction for any list, struct, or map whose values are view-typed.

What changes are included in this PR?

  • ScalarValue::compact now compacts nested view buffers for all container types (List, LargeList, FixedSizeList, ListView, LargeListView, Struct, Map), trimming StringViewArray / BinaryViewArray backing buffers to only the
    referenced bytes.
  • The internal compact_view_buffers helper recursively handles FixedSizeList, ListView, LargeListView, and Map.

Are these changes tested?

new unit tests are added in scalar::tests, one per container type. Each test verifies that after compact() the backing buffer is reduced to exactly the bytes of the referenced string, and that the scalar value is preserved.

Are there any user-facing changes?

No. The public compact / compacted API is unchanged; this PR only fixes the behaviour for view-typed nested arrays.

@github-actions github-actions Bot added the common Related to common crate label Apr 29, 2026
@bert-beyondloops
Copy link
Copy Markdown
Contributor Author

I tried to come up with a minimal solution.
If the implementation should somehow reside in another place, or made in another way, please feel free to comment.

@alamb alamb added the performance Make DataFusion faster label Apr 29, 2026
Copy link
Copy Markdown
Contributor

@Jefffrey Jefffrey left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do wonder if we should locate this functionality within copy_array_data() itself; I think it's possible to achieve what we need via ArrayData too 🤔

pub fn copy_array_data(src_data: &ArrayData) -> ArrayData {
let mut copy = MutableArrayData::new(vec![&src_data], true, src_data.len());
copy.extend(0, 0, src_data.len());
copy.freeze()
}

Especially as the documentation for copy_array_data() states its meant to compact.

@bert-beyondloops
Copy link
Copy Markdown
Contributor Author

bert-beyondloops commented May 19, 2026

@Jefffrey Hi, you mean something like this approach :

pub fn copy_array_data(arr: &dyn Array) -> ArrayRef {
    let src_data = arr.to_data();
    let mut copy = MutableArrayData::new(vec![&src_data], true, src_data.len());
    copy.extend(0, 0, src_data.len());
    compact_view_buffers(make_array(copy.freeze()))
}

In this way I have to change the signature of the copy_array_data method.
It internally really needs the specific Array implementations in order to call the gc() methods on it.
(I want to reuse the different gc implementations on some array types)

There is only 1 external caller on this method in the min_max aggregates space, so that seems ok to change as well.
(So this call benefits as well from the compaction fix)

Or do you still see another approach?

Copy link
Copy Markdown
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this PR is an improvement over what is on main and thus we could merge it in.

I agree with @Jefffrey that as written the code is kind of awkward now (as it dispatches on type twice) but I also think we can clean it up later

I really think the "right" think to do is move this functionality upstream into arrow-rs eventually -- maybe something like add gc to the Array trait 🤔

}
}
ScalarValue::Dictionary(_, value) => {
value.compact();
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there any reason not to recurse into Dictionary, Union, and and REE arrays too? They are all recursive as well


#[test]
fn test_compact_list_utf8view() {
const N: usize = 50;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there is an awful lot of repetition in these test

        const N: usize = 50;
        let strings = make_long_strings(N);
        let one_len = strings.value(0).len();
        assert!(utf8view_buffer_bytes(&strings) >= N * one_len);

        let single_row_list_array =
            SingleRowListArrayBuilder::new(Arc::new(strings.slice(0, 1)) as ArrayRef)
                .build_large_list_array();

...

        assert_eq!(
            utf8view_buffer_bytes(arr.values().as_string_view()),
            one_len
        );
        assert_eq!(arr.values().as_string_view().value(0), strings.value(0));

is just copy/pasted in each one.

Maybe we could move it into a fixture to reduce duplication and make it clearer what is different between different tests

);
}

// ── compact / compact_view_buffers ───────────────────────────────────────
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tested that these tests fail when I back out the code changes

Details
diff --git a/datafusion/common/src/scalar/mod.rs b/datafusion/common/src/scalar/mod.rs
index 63cbce10ae..73f4f0b76c 100644
--- a/datafusion/common/src/scalar/mod.rs
+++ b/datafusion/common/src/scalar/mod.rs
@@ -4719,14 +4719,6 @@ impl ScalarValue {
     /// This can be relevant when `self` is a list or contains a list as a nested value, as
     /// a single list holds an Arc to its entire original array buffer.
     pub fn compact(&mut self) {
-        // copy_array_data + compact_view_buffers + downcast back, all in one step.
-        macro_rules! compact_array {
-            ($arr:expr, $from_type:ty, $($as_method:tt)+) => {
-                *Arc::make_mut($arr) = ScalarValue::compact_view_buffers(
-                    Arc::new(<$from_type>::from(copy_array_data(&$arr.to_data()))) as ArrayRef,
-                ).$($as_method)+.clone()
-            };
-        }
         match self {
             ScalarValue::Null
             | ScalarValue::Boolean(_)
@@ -4770,20 +4762,33 @@ impl ScalarValue {
             | ScalarValue::LargeBinary(_)
             | ScalarValue::BinaryView(_) => (),
             ScalarValue::FixedSizeList(arr) => {
-                compact_array!(arr, FixedSizeListArray, as_fixed_size_list())
+                let array = copy_array_data(&arr.to_data());
+                *Arc::make_mut(arr) = FixedSizeListArray::from(array);
+            }
+            ScalarValue::List(arr) => {
+                let array = copy_array_data(&arr.to_data());
+                *Arc::make_mut(arr) = ListArray::from(array);
             }
-            ScalarValue::List(arr) => compact_array!(arr, ListArray, as_list::<i32>()),
             ScalarValue::LargeList(arr) => {
-                compact_array!(arr, LargeListArray, as_list::<i64>())
+                let array = copy_array_data(&arr.to_data());
+                *Arc::make_mut(arr) = LargeListArray::from(array)
             }
             ScalarValue::ListView(arr) => {
-                compact_array!(arr, ListViewArray, as_list_view::<i32>())
+                let array = copy_array_data(&arr.to_data());
+                *Arc::make_mut(arr) = ListViewArray::from(array);
             }
             ScalarValue::LargeListView(arr) => {
-                compact_array!(arr, LargeListViewArray, as_list_view::<i64>())
+                let array = copy_array_data(&arr.to_data());
+                *Arc::make_mut(arr) = LargeListViewArray::from(array)
+            }
+            ScalarValue::Struct(arr) => {
+                let array = copy_array_data(&arr.to_data());
+                *Arc::make_mut(arr) = StructArray::from(array);
+            }
+            ScalarValue::Map(arr) => {
+                let array = copy_array_data(&arr.to_data());
+                *Arc::make_mut(arr) = MapArray::from(array);
             }
-            ScalarValue::Struct(arr) => compact_array!(arr, StructArray, as_struct()),
-            ScalarValue::Map(arr) => compact_array!(arr, MapArray, as_map()),
-              ScalarValue::Union(val, _, _) => {
                 if let Some((_, value)) = val.as_mut() {
                     value.compact();
@@ -4804,95 +4809,6 @@ impl ScalarValue {
         self
     }

-    /// Recursively compacts the backing buffers of any [`StringViewArray`] or
-    /// [`BinaryViewArray`] nested within `array`.
-    ///
-    /// View-typed arrays keep an `Arc` reference to their original backing
-    /// buffers, so a single scalar extracted from a large batch still retains
-    /// the entire buffer.  Calling [`.gc()`][StringViewArray::gc] copies only
-    /// the bytes that are actually referenced by the surviving views, releasing
-    /// the rest.
-    ///
-    /// Container types (`List`, `LargeList`, `FixedSizeList`, `ListView`,
-    /// `LargeListView`, `Struct`, `Map`) are handled by recursing into their
-    /// child / values arrays and reconstructing the parent with the compacted
-    /// children.  All other types are returned unchanged.
-    fn compact_view_buffers(array: ArrayRef) -> ArrayRef {
-        // Macro for the i32/i64-offset list pair (List / LargeList).
-        macro_rules! gc_list {
-            ($field:expr, $offset_type:ty, $array_type:ty) => {{
-                let list = array.as_list::<$offset_type>();
-                Arc::new(<$array_type>::new(
-                    Arc::clone($field),
-                    list.offsets().clone(),
-                    ScalarValue::compact_view_buffers(Arc::clone(list.values())),
-                    list.nulls().cloned(),
-                )) as ArrayRef
-            }};
-        }
-        // Macro for the i32/i64-offset list-view pair (ListView / LargeListView).
-        macro_rules! gc_list_view {
-            ($field:expr, $offset_type:ty, $array_type:ty) => {{
-                let list = array.as_list_view::<$offset_type>();
-                Arc::new(<$array_type>::new(
-                    Arc::clone($field),
-                    list.offsets().clone(),
-                    list.sizes().clone(),
-                    ScalarValue::compact_view_buffers(Arc::clone(list.values())),
-                    list.nulls().cloned(),
-                )) as ArrayRef
-            }};
-        }
-
-        match array.data_type() {
-            DataType::Utf8View => Arc::new(array.as_string_view().gc()),
-            DataType::BinaryView => Arc::new(array.as_binary_view().gc()),
-            DataType::Struct(_) => {
-                let s = array.as_struct();
-                let columns = s
-                    .columns()
-                    .iter()
-                    .map(|c| ScalarValue::compact_view_buffers(Arc::clone(c)))
-                    .collect();
-                Arc::new(StructArray::new(
-                    s.fields().clone(),
-                    columns,
-                    s.nulls().cloned(),
-                ))
-            }
- 
-            DataType::List(field) => gc_list!(field, i32, ListArray),
-            DataType::LargeList(field) => gc_list!(field, i64, LargeListArray),
-            DataType::FixedSizeList(field, size) => {
-                let list = array.as_fixed_size_list();
-                Arc::new(FixedSizeListArray::new(
-                    Arc::clone(field),
-                    *size,
-                    ScalarValue::compact_view_buffers(Arc::clone(list.values())),
-                    list.nulls().cloned(),
-                ))
-            }
-            DataType::ListView(field) => gc_list_view!(field, i32, ListViewArray),
-            DataType::LargeListView(field) => {
-                gc_list_view!(field, i64, LargeListViewArray)
-            }
-            DataType::Map(field, ordered) => {
-                let map = array.as_map();
-                let entries = ScalarValue::compact_view_buffers(Arc::new(
-                    map.entries().clone(),
-                )
-                    as ArrayRef);
-                Arc::new(MapArray::new(
-                    Arc::clone(field),
-                    map.offsets().clone(),
-                    entries.as_struct().clone(),
-                    map.nulls().cloned(),
-                    *ordered,
-                ))
-            }
-            _ => array,
-        }
-    }
-
$ cargo test -p datafusion-common --lib scalar::tests::test_compact
...
failures:

---- scalar::tests::test_compact_struct_utf8view stdout ----

thread 'scalar::tests::test_compact_struct_utf8view' (76301068) panicked at datafusion/common/src/scalar/mod.rs:10864:9:
assertion `left == right` failed
  left: 1300
 right: 26

---- scalar::tests::test_compact_large_list_view_utf8view stdout ----

thread 'scalar::tests::test_compact_large_list_view_utf8view' (76301064) panicked at datafusion/common/src/scalar/mod.rs:10837:9:
assertion `left == right` failed
  left: 1300
 right: 26
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace

---- scalar::tests::test_compact_list_view_utf8view stdout ----

thread 'scalar::tests::test_compact_list_view_utf8view' (76301066) panicked at datafusion/common/src/scalar/mod.rs:10814:9:
assertion `left == right` failed
  left: 1300
 right: 26

---- scalar::tests::test_compact_list_utf8view stdout ----

thread 'scalar::tests::test_compact_list_utf8view' (76301065) panicked at datafusion/common/src/scalar/mod.rs:10745:9:
assertion `left == right` failed
  left: 1300
 right: 26

---- scalar::tests::test_compact_large_list_utf8view stdout ----

thread 'scalar::tests::test_compact_large_list_utf8view' (76301063) panicked at datafusion/common/src/scalar/mod.rs:10768:9:
assertion `left == right` failed
  left: 1300
 right: 26

---- scalar::tests::test_compact_map_utf8view stdout ----

thread 'scalar::tests::test_compact_map_utf8view' (76301067) panicked at datafusion/common/src/scalar/mod.rs:10904:9:
assertion `left == right` failed
  left: 1300
 right: 26

---- scalar::tests::test_compact_fixed_size_list_utf8view stdout ----

thread 'scalar::tests::test_compact_fixed_size_list_utf8view' (76301062) panicked at datafusion/common/src/scalar/mod.rs:10791:9:
assertion `left == right` failed
  left: 1300
 right: 26


failures:
    scalar::tests::test_compact_fixed_size_list_utf8view
    scalar::tests::test_compact_large_list_utf8view
    scalar::tests::test_compact_large_list_view_utf8view
    scalar::tests::test_compact_list_utf8view
    scalar::tests::test_compact_list_view_utf8view
    scalar::tests::test_compact_map_utf8view
    scalar::tests::test_compact_struct_utf8view

test result: FAILED. 0 passed; 7 failed; 0 ignored; 0 measured; 450 filtered out; finished in 0.00s

error: test failed, to rerun pass `-p datafusion-common --lib`

@alamb
Copy link
Copy Markdown
Contributor

alamb commented May 22, 2026

Given I think this is a strict improvement I will merge it in. Hopefully we can implement some of @Jefffrey 's suggestions on cleanups as a follow on PR on main.

In the interim I will make a backport to the releaes-54 branch

Merged via the queue into apache:main with commit f7d5575 May 22, 2026
35 checks passed
@alamb
Copy link
Copy Markdown
Contributor

alamb commented May 22, 2026

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

common Related to common crate performance Make DataFusion faster

Projects

None yet

Development

Successfully merging this pull request may close these issues.

ScalarValue::compacted() does not free unused view-buffer memory for Utf8View / BinaryView arrays

4 participants