From 8a288068ca5d9b49199500a839a59698e2e2c49b Mon Sep 17 00:00:00 2001 From: "codeflash-ai[bot]" <148906541+codeflash-ai[bot]@users.noreply.github.com> Date: Sat, 24 Jan 2026 09:36:53 +0000 Subject: [PATCH] Optimize _format_grouping_output The optimized code achieves a **22% speedup** by adding a fast-path for single DataFrame/Series inputs and avoiding unnecessary data copies during concatenation. ## Key Optimizations 1. **Fast-path for single inputs**: When only one DataFrame or Series is passed, the function now directly calls `reset_index()` instead of invoking `pd.concat()`. This avoids the overhead of pandas' concatenation machinery, which includes index alignment, metadata merging, and internal data structure creation - all unnecessary when there's only one object. 2. **Zero-copy concatenation**: For multiple DataFrames, the optimization adds `copy=False` to `pd.concat()`, which tells pandas to avoid creating unnecessary copies of the underlying data arrays when possible. This reduces both memory allocation overhead and CPU time spent copying data. ## Performance Impact by Test Case The optimization shows **dramatic improvements for single DataFrame cases** (28-85% faster), which represents a common usage pattern: - Single DataFrame tests: 58-85% faster (e.g., `test_single_dataframe_input`: 73.7% faster) - Multiple DataFrame tests: 8-18% faster (more modest but still meaningful) ## Why This Matters Looking at `function_references`, this function is called from `get_mean_grouping()` in a metrics evaluation pipeline. In that context: - The function is called **once per aggregation field** (see the loop `for field in agg_fields`) - For the common case of a single aggregation field, the fast-path optimization directly applies - Even when multiple fields are aggregated, avoiding data copies reduces memory pressure in data-heavy evaluation workflows The optimizations are particularly beneficial when processing evaluation metrics repeatedly across different document types or connectors, as the cumulative time savings add up across multiple invocations. --- unstructured/metrics/utils.py | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-) diff --git a/unstructured/metrics/utils.py b/unstructured/metrics/utils.py index c490aa752b..128df87181 100644 --- a/unstructured/metrics/utils.py +++ b/unstructured/metrics/utils.py @@ -71,7 +71,9 @@ def _format_grouping_output(*df): Concatenates multiple pandas DataFrame objects along the columns (side-by-side) and resets the index. """ - return pd.concat(df, axis=1).reset_index() + if len(df) == 1 and isinstance(df[0], (pd.DataFrame, pd.Series)): + return df[0].reset_index() + return pd.concat(df, axis=1, copy=False).reset_index() def _display(df):