Preliminary support for views during sort_by operations

FrancescAlted · FrancescAlted · commit c02b489efd0e · 2026-06-20T09:12:21.000+02:00
diff --git a/doc/getting_started/tutorials/13.ctable-basics.ipynb b/doc/getting_started/tutorials/13.ctable-basics.ipynb
@@ -774,12 +774,7 @@
    "cell_type": "markdown",
    "id": "4f466e5d",
    "metadata": {},
-   "source": [
-    "### 3.3 Sorting\n",
-    "\n",
-    "`sort_by()` returns a sorted copy by default (or sorts in-place with `inplace=True`).\n",
-    "Multi-column sorting is supported — primary key first."
-   ]
+   "source": "### 3.3 Sorting\n\n`sort_by()` returns a sorted copy by default (or sorts in-place with `inplace=True`).\nPass `view=True` for a zero-copy sorted **view** that shares the table's data and gathers\nrows on demand — ideal for reading a sorted slice of a large table without copying it.\nMulti-column sorting is supported — primary key first."
   },
   {
    "cell_type": "code",
@@ -1197,37 +1192,9 @@
      "start_time": "2026-05-21T09:38:01.039615Z"
     }
    },
-   "source": [
-    "# Top 10 hottest days in Madrid across the whole year\n",
-    "# Sort the full table, then filter — views cannot be sorted directly\n",
-    "hottest_all = climate.sort_by(\"temperature\", ascending=False)\n",
-    "madrid_sorted = hottest_all.where(hottest_all.city == \"Madrid\")\n",
-    "print(\"10 hottest days in Madrid:\")\n",
-    "print(madrid_sorted.select([\"city\", \"day\", \"temperature\", \"humidity\"]).head(10))"
-   ],
-   "outputs": [
-    {
-     "name": "stdout",
-     "output_type": "stream",
-     "text": [
-      "10 hottest days in Madrid:\n",
-      "     city  day  temperature   humidity\n",
-      "0  Madrid  191    31.399208  42.543335\n",
-      "1  Madrid  190    31.232576  44.303246\n",
-      "2  Madrid  227    31.227442  46.992290\n",
-      "3  Madrid  194    30.915184  35.044228\n",
-      "4  Madrid  186    30.879374  48.080303\n",
-      "5  Madrid  202    30.745684  43.722813\n",
-      "6  Madrid  177    30.469023  38.390163\n",
-      "7  Madrid  163    30.215179  46.051888\n",
-      "8  Madrid  181    30.181025  43.726521\n",
-      "9  Madrid  184    29.936199  50.654797\n",
-      "\n",
-      "[10 rows x 4 columns]\n"
-     ]
-    }
-   ],
-   "execution_count": 21
+   "source": "# Top 10 hottest days in Madrid across the whole year.\n# Views *can* be sorted: sort_by() on a where()-view returns a zero-copy sorted\n# view — it shares the table's columns and gathers rows on demand, no full-table\n# copy. (On a base table, pass view=True for the same lazy behaviour.)\nmadrid = climate.where(climate.city == \"Madrid\")\nmadrid_sorted = madrid.sort_by(\"temperature\", ascending=False)\nprint(\"10 hottest days in Madrid:\")\nprint(madrid_sorted.select([\"city\", \"day\", \"temperature\", \"humidity\"]).head(10))",
+   "outputs": [],
+   "execution_count": null
   },
   {
    "cell_type": "markdown",
@@ -2876,30 +2843,7 @@
    "cell_type": "markdown",
    "id": "405cd155",
    "metadata": {},
-   "source": [
-    "---\n",
-    "## Summary\n",
-    "\n",
-    "Here's everything we covered:\n",
-    "\n",
-    "| Feature | API |\n",
-    "|---------|-----|\n",
-    "| Create | `CTable(Schema)`, `CTable(Schema, new_data=...)` |\n",
-    "| Insert | `append(row)`, `extend(list_or_array)` |\n",
-    "| View | `head()`, `tail()`, `print(t)`, `t.info()` |\n",
-    "| Filter | `where(expr)` → view |\n",
-    "| Project | `select([cols])` → view |\n",
-    "| Sort | `sort_by(cols)`, `sort_by(cols, inplace=True)` |\n",
-    "| Aggregates | `col.sum()`, `.mean()`, `.std()`, `.min()`, `.max()` |\n",
-    "| Stats | `describe()`, `cov()` |\n",
-    "| Mutate | `delete()`, `compact()`, `add_column()`, `drop_column()`, `assign()` |\n",
-    "| Persist | `save(path)`, `to_b2z()`, `to_b2d()`, `CTable.open(path)`, `CTable.load(path)` |\n",
-    "| Interop | `to_arrow()`, `from_arrow()`, `to_csv()`, `from_csv()` |\n",
-    "| Nullable | `null_value=` on spec, `is_null()`, `notnull()`, `null_count()` |\n",
-    "\n",
-    "CTable is designed for **compressed analytical workloads** — large tables that need to stay small in RAM\n",
-    "while still being fast to query and easy to persist."
-   ]
+   "source": "---\n## Summary\n\nHere's everything we covered:\n\n| Feature | API |\n|---------|-----|\n| Create | `CTable(Schema)`, `CTable(Schema, new_data=...)` |\n| Insert | `append(row)`, `extend(list_or_array)` |\n| View | `head()`, `tail()`, `print(t)`, `t.info()` |\n| Filter | `where(expr)` → view |\n| Project | `select([cols])` → view |\n| Sort | `sort_by(cols)`, `sort_by(cols, view=True)`, `sort_by(cols, inplace=True)` |\n| Aggregates | `col.sum()`, `.mean()`, `.std()`, `.min()`, `.max()` |\n| Stats | `describe()`, `cov()` |\n| Mutate | `delete()`, `compact()`, `add_column()`, `drop_column()`, `assign()` |\n| Persist | `save(path)`, `to_b2z()`, `to_b2d()`, `CTable.open(path)`, `CTable.load(path)` |\n| Interop | `to_arrow()`, `from_arrow()`, `to_csv()`, `from_csv()` |\n| Nullable | `null_value=` on spec, `is_null()`, `notnull()`, `null_count()` |\n\nCTable is designed for **compressed analytical workloads** — large tables that need to stay small in RAM\nwhile still being fast to query and easy to persist."
   }
  ],
  "metadata": {
diff --git a/src/blosc2/ctable.py b/src/blosc2/ctable.py
@@ -9920,7 +9920,7 @@ def _normalise_sort_keys(
                 )
         return cols, ascending
 
-    def _sorted_positions_from_full_index(self, name: str, ascending: bool) -> np.ndarray | None:
+    def _sorted_positions_from_full_index(self, name: str, ascending: bool) -> np.ndarray | None:  # noqa: C901
         """Return live physical positions from a matching FULL index, if available.
 
         Reads the pre-sorted positions sidecar directly rather than going through
@@ -9931,10 +9931,11 @@ def _sorted_positions_from_full_index(self, name: str, ascending: bool) -> np.nd
         catalog = root._get_index_catalog()
         descriptor = None
 
+        null_value = None
         if name in root._cols:
             col_info = root._schema.columns_by_name.get(name)
-            if col_info is not None and getattr(col_info.spec, "null_value", None) is not None:
-                return None
+            if col_info is not None:
+                null_value = getattr(col_info.spec, "null_value", None)
             descriptor = catalog.get(name)
             if descriptor is None or descriptor.get("kind") != "full" or descriptor.get("stale", False):
                 descriptor = None
@@ -9960,8 +9961,12 @@ def _sorted_positions_from_full_index(self, name: str, ascending: bool) -> np.nd
         # machinery which is built for selective range queries and is ~70x slower
         # for full-table streaming.
         if positions_path is not None:
-            # Persistent table: positions live in a sidecar .b2nd file.
-            positions_nd = blosc2.open(positions_path, mode="r")
+            # Persistent table: positions live in a sidecar .b2nd file.  Use the
+            # sidecar opener so .b2z (zip) stores are read at their zip offset —
+            # blosc2.open() would look for a standalone file that isn't there.
+            from blosc2.indexing import _open_sidecar_file
+
+            positions_nd = _open_sidecar_file(positions_path)
         else:
             # In-memory table: positions live in the sidecar handle cache.
             from blosc2.indexing import _SIDECAR_HANDLE_CACHE, _sidecar_handle_cache_key
@@ -9976,13 +9981,45 @@ def _sorted_positions_from_full_index(self, name: str, ascending: bool) -> np.nd
                 return None
 
         positions = np.asarray(positions_nd[:], dtype=np.int64)
-        valid = root._valid_rows[:]
-        positions = np.asarray(positions, dtype=np.int64)
-        positions = positions[(positions >= 0) & (positions < len(valid))]
-        positions = positions[valid[positions]]
+        total = len(root._valid_rows)
+        # Index sidecars can carry padding positions beyond the live range, so
+        # the bounds clip always runs — but the ``.all()`` check skips the copy
+        # (and a 24M-element temporary) when there is nothing to clip.
+        in_bounds = (positions >= 0) & (positions < total)
+        if not bool(in_bounds.all()):
+            positions = positions[in_bounds]
+        del in_bounds
+        # Validity filtering only matters when the table has gaps (deleted rows);
+        # for a compact table every clipped position is already live.
+        if root._n_rows is None or root._n_rows != total:
+            valid = root._valid_rows[:]
+            positions = positions[valid[positions]]
         if self is not root:
             current_valid = self._valid_rows[:]
             positions = positions[current_valid[positions]]
+
+        if null_value is not None:
+            # The index sorts by raw value, but sort_by's contract is nulls-last.
+            # Partition explicitly so it holds for any sentinel (NaN sorts last,
+            # an integer sentinel like INT64_MIN sorts first) and either order.
+            # Free each 24M-element temporary as soon as it is consumed to keep
+            # peak memory near the size of the permutation itself.
+            raw = np.asarray(root._cols[name][:])
+            if isinstance(null_value, float) and np.isnan(null_value):
+                null_phys = np.isnan(raw)
+            else:
+                null_phys = raw == null_value
+            del raw
+            if null_phys.any():
+                is_null = null_phys[positions]
+                del null_phys
+                nulls = positions[is_null]
+                nonnull = positions[~is_null]
+                del is_null, positions
+                if not ascending:
+                    nonnull = nonnull[::-1]
+                return np.concatenate([nonnull, nulls])
+
         if not ascending:
             positions = positions[::-1]
         return positions
@@ -10047,8 +10084,15 @@ def sort_by(
         ascending: bool | list[bool] = True,
         *,
         inplace: bool = False,
+        view: bool = False,
     ) -> CTable:
-        """Return a copy of the table sorted by one or more columns.
+        """Return the table sorted by one or more columns.
+
+        By default this materialises a new in-memory copy of the sorted rows.
+        Pass ``view=True`` to instead get a lightweight **sorted view** that
+        shares the parent's column data and gathers rows on demand in sorted
+        order — no whole-table copy.  This is ideal for reading a sorted slice
+        of a large persistent table (e.g. ``t.sort_by("col", view=True)[:10]``).
 
         Parameters
         ----------
@@ -10069,17 +10113,31 @@ def sort_by(
             ``self`` (like :meth:`compact` but sorted).  If ``False``
             (default), return a new in-memory CTable leaving this one
             untouched.
+        view:
+            If ``True``, return a zero-copy sorted **view** over this table
+            instead of materialising a copy: it shares the parent's columns and
+            stores only the sort permutation, gathering rows on demand in sorted
+            order.  Slicing the view (``sv[start:stop:step]``) keeps the sorted
+            order and touches only the rows read.  A single-column sort backed by
+            a non-stale ``FULL`` index reuses its pre-sorted positions (no sort at
+            read time); otherwise only the sort-key column(s) are materialised to
+            build the permutation — never the whole table.  Mutually exclusive
+            with ``inplace``.  Sorting an existing view is always lazy regardless
+            of this flag.
 
         Raises
         ------
         ValueError
-            If called on a view or a read-only table when ``inplace=True``.
+            If called on a view or a read-only table when ``inplace=True``, or if
+            both ``inplace`` and ``view`` are ``True``.
         KeyError
             If any column name is not found.
         TypeError
             If a column used as a sort key does not support ordering
             (e.g. complex numbers).
         """
+        if inplace and view:
+            raise ValueError("inplace=True and view=True are mutually exclusive.")
         if self.base is not None and inplace:
             raise ValueError(
                 "Cannot sort a view inplace (would modify shared column data). Use sort_by(inplace=False) to get a sorted copy."
@@ -10120,7 +10178,7 @@ def sort_by(
         # use those positions directly, so columns are fetched on demand and in
         # the correct sorted order — identical performance to pre-projecting
         # with columns= before calling sort_by.
-        if self.base is not None:
+        if self.base is not None or view:
             result = CTable._make_view(self, self._valid_rows)
             result._cached_live_positions = sorted_pos
             result._n_rows = n
@@ -11332,6 +11390,12 @@ def _run_row_logic(self, ind: int | slice | str | Iterable) -> CTable:
 
         mant_pos = true_pos[ind]
 
+        # For an ordered view (sorted view or position view), preserve the row
+        # order and any duplicates by carrying the positions forward.  A boolean
+        # mask is physical-order and set-like, so it would silently drop both.
+        if getattr(self, "_cached_live_positions", None) is not None:
+            return self._view_from_positions(np.asarray(mant_pos))
+
         new_mask_np = np.zeros(len(self._valid_rows), dtype=bool)
         new_mask_np[mant_pos] = True
 
diff --git a/tests/ctable/test_sort_by.py b/tests/ctable/test_sort_by.py
@@ -414,5 +414,78 @@ def test_sort_unprojected_view_opens_only_needed_columns(tmp_path):
         t.close()
 
 
+def test_sort_view_zero_copy_slice(tmp_path):
+    """sort_by(view=True) returns a zero-copy view whose slices keep sorted order."""
+    rng = np.random.default_rng(0)
+    n = 1000
+    score = rng.integers(0, 50, n).astype(np.float64)  # duplicates on purpose
+    ids = np.arange(n)
+    data = list(zip(ids.tolist(), score.tolist(), [True] * n, strict=True))
+
+    urlpath = str(tmp_path / "sort-view.b2z")
+    t = CTable(Row, new_data=data, urlpath=urlpath, mode="w")
+    t.create_index("id", kind=blosc2.IndexKind.FULL)  # id has a FULL index
+
+    sv = t.sort_by("score", view=True)
+    assert sv.base is not None  # a view, not a materialised copy
+
+    order = np.argsort(score, kind="stable")
+    for sl in [slice(0, 10), slice(-10, None), slice(None, None, 2), slice(100, 50, -1), slice(5, 25, 3)]:
+        np.testing.assert_array_equal(np.asarray(sv[sl]["score"][:]), score[order][sl])
+
+    # Descending, and a FULL-index-backed single-column sort, both stay ordered.
+    svd = t.sort_by("score", ascending=False, view=True)
+    np.testing.assert_array_equal(np.asarray(svd[:10]["score"][:]), score[order[::-1]][:10])
+    svf = t.sort_by("id", view=True)
+    np.testing.assert_array_equal(np.asarray(svf[:10]["id"][:]), np.arange(10))
+
+
+@pytest.mark.parametrize("ascending", [True, False])
+def test_sort_view_full_index_nullable_persistent(tmp_path, ascending):
+    """A FULL index on a nullable column accelerates sort_by(view=True) on a .b2z,
+    and the result keeps nulls last (matching the materialised copy path)."""
+
+    @dataclass
+    class NullRow:
+        key: int = blosc2.field(blosc2.int64(ge=0))
+        val: float = blosc2.field(blosc2.float64(null_value=float("nan")), default=float("nan"))
+
+    rng = np.random.default_rng(1)
+    n = 2000
+    val = rng.integers(0, 100, n).astype(np.float64)
+    val[rng.choice(n, 50, replace=False)] = np.nan  # scattered nulls
+    data = list(zip(range(n), val.tolist(), strict=True))
+
+    urlpath = str(tmp_path / "nullable.b2z")
+    t = CTable(NullRow, new_data=data, urlpath=urlpath, mode="w")
+    t.create_index("val", kind=blosc2.IndexKind.FULL)
+    t.close()
+
+    t = blosc2.CTable.open(urlpath, mode="r")
+    try:
+        # Reference: copy path (its nulls-last behaviour is the contract).
+        ref = np.asarray(t.sort_by("val", ascending=ascending)["val"][:])
+        got = np.asarray(t.sort_by("val", ascending=ascending, view=True)["val"][:])
+        np.testing.assert_array_equal(got, ref)  # NaNs compare equal here via positions
+        # Nulls must be last regardless of direction.
+        assert np.isnan(got[-50:]).all()
+        assert not np.isnan(got[:-50]).any()
+    finally:
+        t.close()
+
+
+def test_sort_view_false_returns_copy():
+    """The default (view=False) still returns an independent in-memory copy."""
+    t = CTable(Row, new_data=DATA)
+    cp = t.sort_by("score")
+    assert cp.base is None
+
+
+def test_sort_view_inplace_mutually_exclusive():
+    t = CTable(Row, new_data=DATA)
+    with pytest.raises(ValueError, match="mutually exclusive"):
+        t.sort_by("score", inplace=True, view=True)
+
+
 if __name__ == "__main__":
     pytest.main(["-v", __file__])