From d03f6e6a8354890b3320c5f05e3ba1d8214b3905 Mon Sep 17 00:00:00 2001
From: Tian Yuchen <114374394+yuchen814@users.noreply.github.com>
Date: Mon, 29 Jun 2026 11:43:36 +0800
Subject: [PATCH] Add datasets regression test suite (QA)

---
 tests/regression/regression_suite.md | 1071 ++++++++++++++++++++++++++
 1 file changed, 1071 insertions(+)
 create mode 100644 tests/regression/regression_suite.md

diff --git a/tests/regression/regression_suite.md b/tests/regression/regression_suite.md
new file mode 100644
index 00000000000..772b3490563
--- /dev/null
+++ b/tests/regression/regression_suite.md
@@ -0,0 +1,1071 @@
+# `datasets` Regression Test Suite
+
+**Target repository:** `huggingface/datasets`
+**Author:** QA Engineering
+**Date:** 2026-06-29
+**Scope source:** `qa_brief.yaml`
+
+---
+
+## 1. Overview
+
+This document defines a manual/semi-automated regression test suite for the Hugging Face
+`datasets` library. It was built by:
+
+1. **Mining historical issue data** (`github-bug-reports-and-issues.jsonl`) — a snapshot of
+   **984 issues** (3019 records incl. 2035 PRs) for `huggingface/datasets`, spanning
+   **2020-04 → 2021-09**, to find the bug categories that *keep repeating*.
+2. **Cross-checking the live repository** via the GitHub Search API (June 2026) to confirm
+   which of those categories are *still* generating open issues today (current issue numbers
+   reach **#8269**), and to surface coverage gaps.
+3. **Mapping the result onto the `qa_brief.yaml` focus areas** to produce a prioritized suite
+   split into a **30-minute smoke pass** and a **180-minute full regression run** with a
+   minimum of **120 test cases** (this suite contains **128**).
+
+### 1.1 Budgets & constraints (from `qa_brief.yaml`)
+
+| Constraint | Value |
+|---|---|
+| Smoke pass budget | 30 minutes |
+| Full regression budget | 180 minutes |
+| Minimum test cases | 120 (**this suite: 128**) |
+| Priority labels | `bug`, `regression`, `streaming`, `cache`, `arrow`, `data-loading`, `critical`, `flaky` |
+
+### 1.2 Focus areas (from `qa_brief.yaml`)
+
+`data loading` · `streaming` · `caching` · `arrow serialization` ·
+`dataset splits and slicing` · `map and filter transformations` ·
+`multiprocessing and parallelism` · `trust_remote_code handling`
+
+---
+
+## 2. Data-driven category analysis
+
+### 2.1 Historical bug volume by category (984 issues, 2020-04 → 2021-09)
+
+Issues were classified by keyword/label matching against the focus areas. An issue may match
+more than one category (e.g. a `num_proc` crash inside `.map()` counts under both
+*map/filter* and *multiprocessing*), so column totals exceed 984.
+
+| Rank | Category | Total issues | Open (snapshot) | Closed | % of issues |
+|---|---|---:|---:|---:|---:|
+| 1 | Data loading | 472 | 147 | 325 | 48% |
+| 2 | Caching | 287 | 77 | 210 | 29% |
+| 3 | Splits & slicing | 282 | 89 | 193 | 29% |
+| 4 | Arrow serialization | 253 | 78 | 175 | 26% |
+| 5 | Encoding/decoding (features) | 239 | 85 | 154 | 24% |
+| 6 | Map & filter | 220 | 63 | 157 | 22% |
+| 7 | Multiprocessing & parallelism | 153 | 46 | 107 | 16% |
+| 8 | Memory / OOM | 49 | 20 | 29 | 5% |
+| 9 | Streaming | 40 | 22 | 18 | 4% |
+
+**Reading the table:** Historically, **data loading** is by far the single largest driver of
+bugs (~half of all issues touch it), followed by a tight cluster of **caching, splits/slicing,
+and arrow serialization**. Streaming had *low absolute volume* in 2020-21 — but note it was a
+*new feature* then, which is the classic profile of a category that grows later (see §2.2).
+
+### 2.2 Live cross-check — currently open issues (GitHub Search API, June 2026)
+
+Queried directly against `repo:huggingface/datasets is:issue is:open`. Current data
+(issue numbers up to #8269) confirms the historical categories are still active.
+
+| Category (live query) | Open issues now | Representative current issue |
+|---|---:|---|
+| **All open issues** | **893** | — |
+| **All open `label:bug`** | **97** | #7037 `Dataset.to_json()` bug; #6937 JSON loader coerces floats→ints |
+| Splits & slicing | 332 | #8269 MetadataConfigs drops parquet shards; #4526 split cache reused across splits |
+| Map & filter | 237 | #5870 `map` behaviour differs Dataset vs IterableDataset; #6787 `TimeoutError in map` |
+| Multiprocessing | 127 | #6936 `save_to_disk()` freezes on s3 with multiprocessing; #6357 pass mp context |
+| trust_remote_code | 36 | #7723 "Don't remove `trust_remote_code`!"; #7692 xopen invalid start byte w/ streaming + remote code |
+
+> **Note on `data loading`, `caching`, `arrow`, `streaming` live counts:** the live search
+> queries for these four returned successfully but their issue bodies contained control
+> characters that broke strict JSON parsing during collection, so exact live counts are not
+> quoted here. Given they are the top historical categories and the library's core surface,
+> they are treated as **high priority** regardless. This is flagged as a known data-collection
+> limitation, not an assumption that they are low-risk.
+
+### 2.3 Key findings
+
+1. **Data loading is the #1 historical bug driver (48% of issues)** and remains the core
+   surface of the library — it gets the largest share of test cases.
+2. **The 2020-21 patterns persist in 2026.** `num_proc` hangs/freezes (#6936), `map`
+   semantics (#5870, #6787), splits/sharding (#8269, #4526), and serialization (#7037, #6937)
+   are *still* open. The historical suite design is therefore still valid.
+3. **`trust_remote_code` is a newer, contentious surface** (36 open, incl. #7723/#7692). It
+   barely registered historically because the argument did not exist in 2020-21 — a clear
+   **coverage gap** that this suite explicitly fills (see §3 TC-TRC-*).
+4. **Streaming is under-represented historically but high-growth.** Low 2020-21 volume + an
+   actively maturing feature ⇒ we deliberately *over-weight* streaming relative to its
+   historical count to avoid a coverage gap.
+
+### 2.4 Coverage-gap summary
+
+| Gap | Evidence | Suite response |
+|---|---|---|
+| `trust_remote_code` security/behaviour | Did not exist in 2020-21 snapshot; 36 open today | New dedicated section TC-TRC-01..10 |
+| Streaming maturity | Only 40 historical issues but feature still growing; #7692 ties streaming+remote code | Over-weighted: TC-STR-01..18 |
+| Multiprocessing hangs/freezes | #6936 (s3 freeze), #6357 still open; classic flaky class | Dedicated flaky-tagged section TC-MP-01..14 |
+| Cross-cutting "behaviour differs by mode" | #5870 (map differs Dataset vs Iterable), #2923/#2866 (errors differ streaming vs normal) | Parity test cases TC-XCUT-* |
+
+---
+
+## 3. Test suite structure & conventions
+
+### 3.1 Test case ID scheme
+
+`TC-<AREA>-<NN>` — area codes:
+
+| Code | Area |
+|---|---|
+| `DL` | Data loading |
+| `STR` | Streaming |
+| `CACHE` | Caching |
+| `ARROW` | Arrow serialization |
+| `SPLIT` | Splits & slicing |
+| `MAP` | Map & filter transformations |
+| `MP` | Multiprocessing & parallelism |
+| `TRC` | trust_remote_code handling |
+| `XCUT` | Cross-cutting / parity |
+
+### 3.2 Priority levels
+
+- **P0 – Critical:** core paths; failure blocks release. All P0 cases are in the smoke pass.
+- **P1 – High:** common real-world usage; in full regression, selected ones in smoke.
+- **P2 – Medium:** edge cases / less-common configurations; full regression only.
+
+### 3.3 Tag legend
+
+Tags map to `qa_brief.yaml` `priority_labels`: `bug`, `regression`, `streaming`, `cache`,
+`arrow`, `data-loading`, `critical`, `flaky`.
+
+### 3.4 Standard preconditions (apply to all cases unless overridden)
+
+- **ENV-1:** Clean Python venv with a single pinned `datasets` version under test, plus
+  `pyarrow`, `pandas`, `numpy`.
+- **ENV-2:** Cache dir reset before each case: `HF_DATASETS_CACHE` pointed at an empty temp
+  directory (prevents stale-cache cross-talk; see TC-CACHE-*).
+- **ENV-3:** Network available for hub-backed cases; offline cases set `HF_DATASETS_OFFLINE=1`.
+- **ENV-4:** Fixtures (small local CSV/JSON/Parquet/text files and a tiny loading script)
+  staged under `./fixtures/`.
+
+### 3.5 Test-run profiles
+
+| Profile | Budget | Cases | Selection rule |
+|---|---|---|---|
+| **Smoke** | 30 min | **24** P0 cases | One or two canary cases per focus area; happy paths + top regressions only |
+| **Full regression** | 180 min | **128** (all) | Every case below |
+
+---
+
+## 4. Smoke pass (P0, ≤30 min, 24 cases)
+
+Run in this order; abort the release pipeline on any failure.
+
+| # | TC ID | Area | Title |
+|---|---|---|---|
+| 1 | TC-DL-01 | Data loading | Load a Hub dataset by name (happy path) |
+| 2 | TC-DL-02 | Data loading | Load local CSV via `load_dataset("csv", ...)` |
+| 3 | TC-DL-03 | Data loading | Load local JSON / JSONL |
+| 4 | TC-DL-05 | Data loading | `load_dataset` with explicit `split=` |
+| 5 | TC-DL-09 | Data loading | Checksum / split-size verification passes |
+| 6 | TC-STR-01 | Streaming | `streaming=True` returns an `IterableDataset` and iterates |
+| 7 | TC-STR-03 | Streaming | Streamed vs non-streamed first-row parity |
+| 8 | TC-CACHE-01 | Caching | Second load reuses cache (no re-download) |
+| 9 | TC-CACHE-04 | Caching | `download_mode="force_redownload"` bypasses cache |
+| 10 | TC-ARROW-01 | Arrow | Round-trip `save_to_disk` / `load_from_disk` |
+| 11 | TC-ARROW-03 | Arrow | Load Parquet file preserves schema |
+| 12 | TC-SPLIT-01 | Splits | `train_test_split` produces non-empty splits |
+| 13 | TC-SPLIT-04 | Splits | Percentage slicing `split="train[:10%]"` |
+| 14 | TC-SPLIT-07 | Splits | `select()` returns rows and supports `.shape` |
+| 15 | TC-MAP-01 | Map/filter | `map()` adds a column (single process) |
+| 16 | TC-MAP-04 | Map/filter | `filter()` returns correct subset |
+| 17 | TC-MAP-07 | Map/filter | `map(batched=True)` happy path |
+| 18 | TC-MP-01 | Multiprocessing | `map(num_proc=4)` completes & matches single-proc |
+| 19 | TC-MP-05 | Multiprocessing | `filter(num_proc>1)` completes without hang |
+| 20 | TC-TRC-01 | trust_remote_code | Script dataset blocked without `trust_remote_code` |
+| 21 | TC-TRC-02 | trust_remote_code | Script dataset loads with `trust_remote_code=True` |
+| 22 | TC-XCUT-01 | Cross-cutting | Same dataset loads in normal AND streaming mode |
+| 23 | TC-ARROW-06 | Arrow | `to_pandas()` / `to_dict()` round-trip values |
+| 24 | TC-DL-12 | Data loading | Offline mode loads already-cached dataset |
+
+---
+
+## 5. Full regression test cases
+
+> Format for each case: **Priority · Tags · Preconditions · Steps · Expected results.**
+> `[SMOKE]` marks cases also in the 30-minute pass.
+
+### 5.1 Data loading (TC-DL) — *largest category: 48% of historical bugs*
+
+---
+
+#### TC-DL-01 — Load a Hub dataset by name (happy path) `[SMOKE]`
+- **Priority:** P0 · **Tags:** data-loading, critical
+- **Preconditions:** ENV-1..4; network up; pick a small canonical dataset (e.g. `rotten_tomatoes`).
+- **Steps:**
+  1. Call `load_dataset("rotten_tomatoes")`.
+  2. Inspect the returned object type and keys.
+  3. Read `ds["train"][0]`.
+- **Expected:** Returns a `DatasetDict` with `train`/`validation`/`test`; each is a `Dataset`;
+  first row is a dict with the documented features; no exception.
+
+#### TC-DL-02 — Load local CSV `[SMOKE]`
+- **Priority:** P0 · **Tags:** data-loading
+- **Preconditions:** `./fixtures/data.csv` with header + ≥3 rows.
+- **Steps:**
+  1. `load_dataset("csv", data_files="./fixtures/data.csv")`.
+  2. Check `ds["train"].column_names` and row count.
+- **Expected:** Columns match CSV header; row count equals data rows; dtypes inferred correctly.
+
+#### TC-DL-03 — Load local JSON / JSONL `[SMOKE]`
+- **Priority:** P0 · **Tags:** data-loading
+- **Preconditions:** `./fixtures/data.jsonl` (one JSON object per line).
+- **Steps:**
+  1. `load_dataset("json", data_files="./fixtures/data.jsonl")`.
+  2. Verify nested fields are preserved.
+- **Expected:** Loads without `ArrowInvalid`; nested structures preserved; counts correct.
+
+#### TC-DL-04 — JSON numeric type coercion (regression of #6937)
+- **Priority:** P1 · **Tags:** data-loading, bug, regression
+- **Preconditions:** JSONL where a field has mixed `1`, `2.5` values.
+- **Steps:**
+  1. Load with the JSON builder.
+  2. Inspect the resolved feature dtype and the actual loaded values.
+- **Expected:** Floats are NOT silently coerced to integers; the value `2.5` is preserved
+  (guards regression #6937).
+
+#### TC-DL-05 — `load_dataset` with explicit `split=` `[SMOKE]`
+- **Priority:** P0 · **Tags:** data-loading
+- **Preconditions:** ENV; dataset with multiple splits.
+- **Steps:**
+  1. `load_dataset(name, split="train")`.
+  2. Confirm a single `Dataset` (not a `DatasetDict`) is returned.
+- **Expected:** Returns `Dataset` object for the requested split only.
+
+#### TC-DL-06 — Load multiple `data_files` mapped to splits
+- **Priority:** P1 · **Tags:** data-loading
+- **Preconditions:** `train.csv`, `test.csv` fixtures.
+- **Steps:**
+  1. `load_dataset("csv", data_files={"train":"train.csv","test":"test.csv"})`.
+- **Expected:** `DatasetDict` with exactly `train` and `test`; correct per-split counts.
+
+#### TC-DL-07 — Load with a named config / subset
+- **Priority:** P1 · **Tags:** data-loading
+- **Preconditions:** A dataset exposing multiple configs (e.g. `glue`, config `mrpc`).
+- **Steps:**
+  1. `load_dataset("glue", "mrpc")`.
+- **Expected:** Correct config-specific features and splits load.
+
+#### TC-DL-08 — Missing config raises actionable error
+- **Priority:** P2 · **Tags:** data-loading
+- **Preconditions:** Multi-config dataset.
+- **Steps:**
+  1. Call `load_dataset` with a non-existent config name.
+- **Expected:** Raises `ValueError` listing available configs; message is human-readable.
+
+#### TC-DL-09 — Checksum / split-size verification passes `[SMOKE]`
+- **Priority:** P0 · **Tags:** data-loading, critical, regression
+- **Preconditions:** A dataset with recorded `dataset_infos`/metadata.
+- **Steps:**
+  1. Load the dataset normally.
+  2. Observe verification of number of examples and checksums.
+- **Expected:** No `NonMatchingSplitsSizesError` / `NonMatchingChecksumError` on a healthy
+  dataset (guards regression class #2941, #2882).
+
+#### TC-DL-10 — `verification_mode="no_checks"` skips verification
+- **Priority:** P2 · **Tags:** data-loading
+- **Preconditions:** Dataset whose recorded sizes intentionally mismatch (fixture).
+- **Steps:**
+  1. Load with `verification_mode="no_checks"`.
+- **Expected:** Load succeeds without raising; documents the escape hatch behaviour.
+
+#### TC-DL-11 — Load Parquet directly
+- **Priority:** P1 · **Tags:** data-loading, arrow
+- **Preconditions:** `./fixtures/data.parquet`.
+- **Steps:**
+  1. `load_dataset("parquet", data_files="./fixtures/data.parquet")`.
+- **Expected:** Schema and values preserved; no exception.
+
+#### TC-DL-12 — Offline mode loads cached dataset `[SMOKE]`
+- **Priority:** P0 · **Tags:** data-loading, cache
+- **Preconditions:** Dataset previously loaded into cache; then set `HF_DATASETS_OFFLINE=1`.
+- **Steps:**
+  1. Re-run `load_dataset(name)` with offline flag set.
+- **Expected:** Loads from cache with no network call; no hang/timeout.
+
+#### TC-DL-13 — Offline mode without cache fails fast
+- **Priority:** P2 · **Tags:** data-loading
+- **Preconditions:** Empty cache; `HF_DATASETS_OFFLINE=1`.
+- **Steps:**
+  1. Attempt to load an uncached dataset.
+- **Expected:** Raises a clear offline error promptly (no long network retry/hang).
+
+#### TC-DL-14 — Load text file dataset
+- **Priority:** P1 · **Tags:** data-loading
+- **Preconditions:** `./fixtures/corpus.txt`.
+- **Steps:**
+  1. `load_dataset("text", data_files="./fixtures/corpus.txt")`.
+- **Expected:** One row per line; `text` column present.
+
+#### TC-DL-15 — Load from in-memory pandas / dict
+- **Priority:** P1 · **Tags:** data-loading
+- **Steps:**
+  1. `Dataset.from_pandas(df)` and `Dataset.from_dict({...})`.
+- **Expected:** Correct features inferred; counts match the source.
+
+#### TC-DL-16 — `pathlib.Path` accepted for paths (regression of #6829)
+- **Priority:** P1 · **Tags:** data-loading, bug, regression
+- **Preconditions:** Fixture path as a `pathlib.Path`.
+- **Steps:**
+  1. Pass a `Path` object to `load_dataset(..., data_files=Path(...))` and to `save_to_disk`.
+- **Expected:** `Path` objects accepted (guards regression #6829, where Path stopped working).
+
+#### TC-DL-17 — gzip / compressed content decoding (regression of #2918)
+- **Priority:** P2 · **Tags:** data-loading, bug
+- **Preconditions:** gzip-compressed JSON/CSV fixture.
+- **Steps:**
+  1. Load the compressed file.
+- **Expected:** Decompresses and loads correctly; no "Can not decode content-encoding: gzip".
+
+#### TC-DL-18 — `data_dir` directory globbing
+- **Priority:** P2 · **Tags:** data-loading
+- **Preconditions:** Directory of multiple CSVs.
+- **Steps:**
+  1. `load_dataset("csv", data_dir="./fixtures/csvs/")`.
+- **Expected:** All matching files concatenated into one dataset; total count correct.
+
+### 5.2 Streaming (TC-STR) — *over-weighted: high-growth, gap-filled*
+
+---
+
+#### TC-STR-01 — `streaming=True` yields `IterableDataset` `[SMOKE]`
+- **Priority:** P0 · **Tags:** streaming, critical
+- **Steps:**
+  1. `load_dataset(name, split="train", streaming=True)`.
+  2. `next(iter(ds))`.
+- **Expected:** Object is an `IterableDataset`; first element is a feature dict; no full download.
+
+#### TC-STR-02 — Streaming local files
+- **Priority:** P1 · **Tags:** streaming, data-loading
+- **Steps:**
+  1. Stream a local CSV/JSON fixture with `streaming=True`.
+- **Expected:** Iterates rows lazily; values correct.
+
+#### TC-STR-03 — Streamed vs non-streamed first-row parity `[SMOKE]`
+- **Priority:** P0 · **Tags:** streaming, regression
+- **Steps:**
+  1. Load dataset normally and streaming.
+  2. Compare first N rows.
+- **Expected:** Identical content and feature schema between modes (guards #2923/#2866 class).
+
+#### TC-STR-04 — `IterableDataset.map()` lazy transform
+- **Priority:** P1 · **Tags:** streaming, map
+- **Steps:**
+  1. Apply `.map(fn)` to a streamed dataset; iterate.
+- **Expected:** Transform applied lazily per element; result matches eager `.map`.
+
+#### TC-STR-05 — `IterableDataset.filter()`
+- **Priority:** P1 · **Tags:** streaming, map
+- **Steps:**
+  1. `.filter(pred)` on streamed dataset; iterate.
+- **Expected:** Only matching rows yielded.
+
+#### TC-STR-06 — `remove_columns` on `IterableDataset` (regression of #2944)
+- **Priority:** P1 · **Tags:** streaming, regression
+- **Steps:**
+  1. `streamed.remove_columns([...])`; iterate.
+- **Expected:** Listed columns absent from yielded rows (feature added in #2944).
+
+#### TC-STR-07 — `take()` / `skip()` on streamed dataset
+- **Priority:** P1 · **Tags:** streaming
+- **Steps:**
+  1. `ds.take(5)` and `ds.skip(5)`; iterate both.
+- **Expected:** `take` yields first 5; `skip` yields from the 6th onward; no overlap.
+
+#### TC-STR-08 — Streaming `shuffle(buffer_size=...)`
+- **Priority:** P2 · **Tags:** streaming, flaky
+- **Steps:**
+  1. `ds.shuffle(seed=42, buffer_size=100)`; iterate.
+- **Expected:** Order differs from unshuffled; with fixed seed the order is reproducible.
+
+#### TC-STR-09 — Stream sharded dataset across files
+- **Priority:** P2 · **Tags:** streaming
+- **Steps:**
+  1. Stream a multi-shard dataset.
+- **Expected:** All shards traversed; total iterated count equals sum of shard counts.
+
+#### TC-STR-10 — Convert streamed → `torch` iterable for DataLoader
+- **Priority:** P2 · **Tags:** streaming
+- **Steps:**
+  1. `ds.with_format("torch")`; wrap in a `torch.utils.data.DataLoader`.
+- **Expected:** Batches yield tensors; no crash.
+
+#### TC-STR-11 — Streaming respects `features=` override
+- **Priority:** P2 · **Tags:** streaming, arrow
+- **Steps:**
+  1. Stream with an explicit `Features` schema.
+- **Expected:** Yielded rows cast to the declared schema.
+
+#### TC-STR-12 — Streaming over gzip/remote compressed shards
+- **Priority:** P2 · **Tags:** streaming, bug
+- **Steps:**
+  1. Stream a gzip-compressed remote/​local shard.
+- **Expected:** Transparent decompression; rows correct (relates to #2918).
+
+#### TC-STR-13 — Streamed `IterableDatasetDict` per-split access
+- **Priority:** P2 · **Tags:** streaming
+- **Steps:**
+  1. `load_dataset(name, streaming=True)` (no split); access `["train"]`.
+- **Expected:** Returns `IterableDatasetDict`; each split iterable independently.
+
+#### TC-STR-14 — `rename_column` on streamed dataset
+- **Priority:** P2 · **Tags:** streaming
+- **Steps:**
+  1. `streamed.rename_column("a","b")`; iterate.
+- **Expected:** Column renamed in yielded rows.
+
+#### TC-STR-15 — `cast_column` on streamed dataset
+- **Priority:** P2 · **Tags:** streaming, arrow
+- **Steps:**
+  1. `streamed.cast_column("label", ClassLabel(...))`; iterate.
+- **Expected:** Values cast lazily; no error.
+
+#### TC-STR-16 — Interrupt/resume streaming iteration
+- **Priority:** P2 · **Tags:** streaming, flaky
+- **Steps:**
+  1. Begin iterating, stop after K, create a fresh iterator.
+- **Expected:** Fresh iterator restarts from the beginning deterministically.
+
+#### TC-STR-17 — Streaming + `trust_remote_code` script dataset (regression of #7692)
+- **Priority:** P1 · **Tags:** streaming, bug, regression
+- **Preconditions:** A script-based dataset that supports streaming.
+- **Steps:**
+  1. `load_dataset(script, streaming=True, trust_remote_code=True)`; iterate.
+- **Expected:** Streams without "invalid start byte"/decode error (guards #7692).
+
+#### TC-STR-18 — Empty streamed dataset handled
+- **Priority:** P2 · **Tags:** streaming
+- **Steps:**
+  1. Stream a dataset/filter that yields zero rows.
+- **Expected:** Iterator simply ends; no exception.
+
+### 5.3 Caching (TC-CACHE) — *2nd largest: 29% of historical bugs*
+
+---
+
+#### TC-CACHE-01 — Second load reuses cache `[SMOKE]`
+- **Priority:** P0 · **Tags:** cache, critical
+- **Steps:**
+  1. Load dataset (cold). 2. Load again (warm), timing both.
+- **Expected:** Second load reads from cache (no re-download); markedly faster.
+
+#### TC-CACHE-02 — Cache survives across processes
+- **Priority:** P1 · **Tags:** cache
+- **Steps:**
+  1. Load in process A; load again in a fresh process B (same `HF_DATASETS_CACHE`).
+- **Expected:** Process B reuses the cache produced by A.
+
+#### TC-CACHE-03 — Distinct `cache_dir` isolates caches
+- **Priority:** P1 · **Tags:** cache
+- **Steps:**
+  1. Load with `cache_dir=A`, then `cache_dir=B`.
+- **Expected:** Two independent cache trees; neither pollutes the other.
+
+#### TC-CACHE-04 — `force_redownload` bypasses cache `[SMOKE]`
+- **Priority:** P0 · **Tags:** cache, regression
+- **Steps:**
+  1. Load (cache populated). 2. Load with `download_mode="force_redownload"`.
+- **Expected:** Data re-downloaded/re-prepared; works as documented (guards FORCE_REDOWNLOAD #2904).
+
+#### TC-CACHE-05 — `reuse_dataset_if_exists` default
+- **Priority:** P1 · **Tags:** cache
+- **Steps:**
+  1. Load twice with default download mode.
+- **Expected:** Prepared dataset reused; no rebuild.
+
+#### TC-CACHE-06 — Cache reused after moving cache directory (regression of #2496)
+- **Priority:** P1 · **Tags:** cache, bug, regression
+- **Steps:**
+  1. Build cache in dir A. 2. Move dir A → dir B; point `HF_DATASETS_CACHE` to B. 3. Reload.
+- **Expected:** Cache still valid (fingerprint not invalidated by path change) — guards #2496.
+
+#### TC-CACHE-07 — `map` results cached and reused
+- **Priority:** P1 · **Tags:** cache, map
+- **Steps:**
+  1. Run `.map(fn)`; re-run identical `.map(fn)`.
+- **Expected:** Second run loads cached arrow file (fingerprint match); function not re-executed.
+
+#### TC-CACHE-08 — Different `map` fn ⇒ new cache (fingerprint changes)
+- **Priority:** P1 · **Tags:** cache, map
+- **Steps:**
+  1. `.map(fn1)` then `.map(fn2)` (different logic).
+- **Expected:** Distinct fingerprints; fn2 actually executed (no stale reuse).
+
+#### TC-CACHE-09 — `load_from_cache_file=False` forces recompute
+- **Priority:** P2 · **Tags:** cache, map
+- **Steps:**
+  1. `.map(fn, load_from_cache_file=False)` twice.
+- **Expected:** Function re-runs both times.
+
+#### TC-CACHE-10 — Deterministic fingerprint across runs (regression of #2775)
+- **Priority:** P2 · **Tags:** cache, flaky, regression
+- **Steps:**
+  1. In two fresh processes, build the same dataset+map; compare `_fingerprint`.
+- **Expected:** Fingerprints identical & deterministic (guards #2775 non-deterministic fingerprint).
+
+#### TC-CACHE-11 — `cleanup_cache_files()` removes stale files
+- **Priority:** P2 · **Tags:** cache
+- **Steps:**
+  1. Generate several cached map files; call `ds.cleanup_cache_files()`.
+- **Expected:** Returns count removed; stale files deleted; current dataset still usable.
+
+#### TC-CACHE-12 — Windows-style path / permission handling (regression of #2937)
+- **Priority:** P2 · **Tags:** cache, bug
+- **Preconditions:** Run on Windows or simulate locked-file scenario.
+- **Steps:**
+  1. Load dataset to default cache; ensure temp files are released.
+- **Expected:** No `PermissionError` on cache files (guards #2937, #2471).
+
+#### TC-CACHE-13 — Per-split cache isolation (regression of #4526)
+- **Priority:** P1 · **Tags:** cache, split, bug, regression
+- **Steps:**
+  1. Process `train` split with `.map`; then process `test` split with same fn.
+- **Expected:** `test` is NOT served the `train` cache; outputs differ correctly (guards #4526).
+
+#### TC-CACHE-14 — Disk-space / cache growth sanity (relates to #2591)
+- **Priority:** P2 · **Tags:** cache, flaky
+- **Steps:**
+  1. Run repeated `.map` operations; monitor cache dir size.
+- **Expected:** No unbounded growth from a single repeated identical op (cache reused, not duplicated).
+
+### 5.4 Arrow serialization (TC-ARROW) — *26% of historical bugs*
+
+---
+
+#### TC-ARROW-01 — `save_to_disk` / `load_from_disk` round-trip `[SMOKE]`
+- **Priority:** P0 · **Tags:** arrow, critical
+- **Steps:**
+  1. `ds.save_to_disk(path)`; 2. `load_from_disk(path)`; 3. compare.
+- **Expected:** Loaded dataset equals original (schema, counts, values).
+
+#### TC-ARROW-02 — Round-trip a `DatasetDict`
+- **Priority:** P1 · **Tags:** arrow
+- **Steps:**
+  1. Save/load a multi-split `DatasetDict`.
+- **Expected:** All splits restored identically.
+
+#### TC-ARROW-03 — Parquet load preserves schema `[SMOKE]`
+- **Priority:** P0 · **Tags:** arrow, data-loading
+- **Steps:**
+  1. Load a Parquet fixture with nested + nullable columns.
+- **Expected:** Schema/types preserved; nulls intact.
+
+#### TC-ARROW-04 — `cast_column` to `ClassLabel`
+- **Priority:** P1 · **Tags:** arrow
+- **Steps:**
+  1. `ds.cast_column("label", ClassLabel(names=[...]))`.
+- **Expected:** Integer-encoded labels; `.features` reflect `ClassLabel`.
+
+#### TC-ARROW-05 — `cast` whole-dataset schema change
+- **Priority:** P1 · **Tags:** arrow
+- **Steps:**
+  1. `ds.cast(Features({...}))` with a compatible new schema.
+- **Expected:** Values cast correctly; incompatible casts raise a clear error.
+
+#### TC-ARROW-06 — `to_pandas` / `to_dict` round-trip `[SMOKE]`
+- **Priority:** P0 · **Tags:** arrow
+- **Steps:**
+  1. `df = ds.to_pandas()`, `d = ds.to_dict()`; rebuild via `from_pandas`/`from_dict`.
+- **Expected:** Values preserved both directions.
+
+#### TC-ARROW-07 — `map` with missing/None values (regression of #2831)
+- **Priority:** P1 · **Tags:** arrow, map, bug, regression
+- **Steps:**
+  1. `.map` over a column containing `None`/missing values.
+- **Expected:** No `ArrowInvalid`; nulls handled (guards #2831).
+
+#### TC-ARROW-08 — Added column length must match (regression of #2768)
+- **Priority:** P1 · **Tags:** arrow, map, bug, regression
+- **Steps:**
+  1. In `.map(batched=True)`, return a column whose length matches the batch.
+- **Expected:** Succeeds; a deliberate length-mismatch raises the documented clear error
+  (guards #2768 "Added column's length must match table's length").
+
+#### TC-ARROW-09 — JSON → Arrow nested type (regression of #2799)
+- **Priority:** P2 · **Tags:** arrow, data-loading, bug
+- **Steps:**
+  1. Load JSON containing deeply nested/heterogeneous structures.
+- **Expected:** Loads without `ArrowNotImplementedError`, or fails with a clear, documented message (guards #2799).
+
+#### TC-ARROW-10 — Int overflow / large values
+- **Priority:** P2 · **Tags:** arrow
+- **Steps:**
+  1. Build a dataset with very large ints exceeding int32.
+- **Expected:** Promoted to int64 / handled without silent corruption.
+
+#### TC-ARROW-11 — `Sequence` feature with None entries (regression of #2892)
+- **Priority:** P2 · **Tags:** arrow, bug
+- **Steps:**
+  1. Encode a `Sequence(...)` feature where some rows are `None`.
+- **Expected:** Encodes without error (guards #2892).
+
+#### TC-ARROW-12 — `to_json` export correctness (regression of #7037)
+- **Priority:** P1 · **Tags:** arrow, bug, regression
+- **Steps:**
+  1. `ds.to_json(path)`; reload via JSON builder; compare.
+- **Expected:** Export is valid and round-trips (guards #7037 `to_json` bug).
+
+#### TC-ARROW-13 — `to_csv` / `to_parquet` exports
+- **Priority:** P2 · **Tags:** arrow
+- **Steps:**
+  1. `ds.to_csv(...)` and `ds.to_parquet(...)`; reload.
+- **Expected:** Round-trip values preserved (allowing for CSV stringification).
+
+#### TC-ARROW-14 — Memory-mapped read of large arrow table
+- **Priority:** P2 · **Tags:** arrow, flaky
+- **Steps:**
+  1. `load_from_disk` a large dataset; access random rows.
+- **Expected:** Backed by memory-map; RAM stays bounded; random access works.
+
+#### TC-ARROW-15 — `concatenate_datasets` schema compatibility
+- **Priority:** P1 · **Tags:** arrow
+- **Steps:**
+  1. Concatenate two datasets with identical schemas; then with incompatible ones.
+- **Expected:** Compatible concat works; incompatible raises a clear schema error.
+
+#### TC-ARROW-16 — `interleave_datasets`
+- **Priority:** P2 · **Tags:** arrow, streaming
+- **Steps:**
+  1. Interleave two datasets with given probabilities.
+- **Expected:** Mixed output respects ratios; works for both eager and streaming.
+
+### 5.5 Splits & slicing (TC-SPLIT) — *29% of historical bugs; 332 open today*
+
+---
+
+#### TC-SPLIT-01 — `train_test_split` non-empty `[SMOKE]`
+- **Priority:** P0 · **Tags:** split, critical, regression
+- **Steps:**
+  1. `ds.train_test_split(test_size=0.2)`.
+- **Expected:** Both splits non-empty; sizes ≈ ratio (guards #676 empty-split regression).
+
+#### TC-SPLIT-02 — `train_test_split` reproducible with seed
+- **Priority:** P1 · **Tags:** split, flaky
+- **Steps:**
+  1. Run twice with same `seed`; compare indices.
+- **Expected:** Identical partition across runs.
+
+#### TC-SPLIT-03 — `train_test_split` on a `Dataset` (not DatasetDict) (regression of #1600)
+- **Priority:** P1 · **Tags:** split, bug, regression
+- **Steps:**
+  1. Call `train_test_split` on a single `Dataset`.
+- **Expected:** Works (the method exists on `Dataset`); guards #1600 AttributeError class.
+
+#### TC-SPLIT-04 — Percentage slicing `split="train[:10%]"` `[SMOKE]`
+- **Priority:** P0 · **Tags:** split
+- **Steps:**
+  1. `load_dataset(name, split="train[:10%]")`.
+- **Expected:** Returns ~10% of train rows.
+
+#### TC-SPLIT-05 — Absolute index slicing `split="train[:100]"`
+- **Priority:** P1 · **Tags:** split
+- **Steps:**
+  1. `load_dataset(name, split="train[:100]")`.
+- **Expected:** Exactly 100 rows (or fewer if dataset smaller).
+
+#### TC-SPLIT-06 — Combined / additive split expression
+- **Priority:** P2 · **Tags:** split
+- **Steps:**
+  1. `split="train+test"` and `split="train[:50%]+test[50%:]"`.
+- **Expected:** Concatenated per the expression; counts add up.
+
+#### TC-SPLIT-07 — `select()` returns rows & supports `.shape` `[SMOKE]`
+- **Priority:** P0 · **Tags:** split, regression
+- **Steps:**
+  1. `sub = ds.select(range(10))`; read `sub.shape`, `sub[0]`.
+- **Expected:** 10 rows; `.shape` returns `(10, n_cols)` (guards #1622 select().shape).
+
+#### TC-SPLIT-08 — `select()` with out-of-range index
+- **Priority:** P2 · **Tags:** split
+- **Steps:**
+  1. `ds.select([0, len(ds)+5])`.
+- **Expected:** Raises a clear `IndexError`-style error; no silent wrap.
+
+#### TC-SPLIT-09 — `shard()` partitions correctly
+- **Priority:** P1 · **Tags:** split
+- **Steps:**
+  1. `ds.shard(num_shards=4, index=i)` for i in 0..3; concatenate.
+- **Expected:** Shards are disjoint and cover the full dataset exactly once.
+
+#### TC-SPLIT-10 — Integer / slice / list indexing
+- **Priority:** P1 · **Tags:** split
+- **Steps:**
+  1. `ds[0]`, `ds[:5]`, `ds[[1,3,5]]`.
+- **Expected:** Row dict, column-wise dict-of-lists, and selected rows respectively.
+
+#### TC-SPLIT-11 — Negative indexing
+- **Priority:** P2 · **Tags:** split
+- **Steps:**
+  1. `ds[-1]`.
+- **Expected:** Returns last row (or documented error if unsupported) — behaviour is consistent.
+
+#### TC-SPLIT-12 — `sort()` then slice
+- **Priority:** P2 · **Tags:** split
+- **Steps:**
+  1. `ds.sort("col")[:5]`.
+- **Expected:** Rows sorted ascending; slice reflects new order.
+
+#### TC-SPLIT-13 — `shuffle(seed)` reproducibility
+- **Priority:** P1 · **Tags:** split, flaky
+- **Steps:**
+  1. `ds.shuffle(seed=42)` twice; compare order.
+- **Expected:** Deterministic identical order across runs.
+
+#### TC-SPLIT-14 — Named-split metadata preserved
+- **Priority:** P2 · **Tags:** split
+- **Steps:**
+  1. After slicing, inspect `ds.split` attribute.
+- **Expected:** Split name/metadata reflects the slice expression.
+
+#### TC-SPLIT-15 — NonMatchingSplitsSizes surfaced on corrupt split (regression of #2941)
+- **Priority:** P1 · **Tags:** split, data-loading, bug, regression
+- **Preconditions:** Fixture with deliberately wrong recorded split size.
+- **Steps:**
+  1. Load with default verification.
+- **Expected:** Raises `NonMatchingSplitsSizesError` (detection works; guards #2941).
+
+#### TC-SPLIT-16 — Parquet shard export keeps all shards (regression of #8269)
+- **Priority:** P1 · **Tags:** split, arrow, bug, regression
+- **Steps:**
+  1. Export a multi-shard config to parquet; reload.
+- **Expected:** No shards dropped; full row count preserved (guards #8269).
+
+### 5.6 Map & filter transformations (TC-MAP) — *237 open today*
+
+---
+
+#### TC-MAP-01 — `map()` adds a column (single proc) `[SMOKE]`
+- **Priority:** P0 · **Tags:** map, critical
+- **Steps:**
+  1. `ds.map(lambda x: {"len": len(x["text"])})`.
+- **Expected:** New `len` column present with correct values; original columns intact.
+
+#### TC-MAP-02 — `map()` modifies existing column
+- **Priority:** P1 · **Tags:** map
+- **Steps:**
+  1. `ds.map(lambda x: {"text": x["text"].lower()})`.
+- **Expected:** Column updated in place; row count unchanged.
+
+#### TC-MAP-03 — `map(remove_columns=...)`
+- **Priority:** P1 · **Tags:** map
+- **Steps:**
+  1. `ds.map(fn, remove_columns=["text"])`.
+- **Expected:** Listed columns removed; new columns retained.
+
+#### TC-MAP-04 — `filter()` correct subset `[SMOKE]`
+- **Priority:** P0 · **Tags:** map, critical
+- **Steps:**
+  1. `ds.filter(lambda x: x["label"] == 1)`.
+- **Expected:** Only matching rows remain; count correct.
+
+#### TC-MAP-05 — `filter(fn_kwargs=...)` (regression of #2927/#2950)
+- **Priority:** P1 · **Tags:** map, bug, regression
+- **Steps:**
+  1. `ds.filter(pred, fn_kwargs={"threshold": 5})`.
+- **Expected:** kwargs passed through correctly (guards #2927, fixed in PR #2950).
+
+#### TC-MAP-06 — `map(fn_kwargs=...)`
+- **Priority:** P1 · **Tags:** map
+- **Steps:**
+  1. `ds.map(fn, fn_kwargs={"suffix": "!"})`.
+- **Expected:** Extra kwargs reach the function.
+
+#### TC-MAP-07 — `map(batched=True)` `[SMOKE]`
+- **Priority:** P0 · **Tags:** map, critical
+- **Steps:**
+  1. `ds.map(batch_fn, batched=True, batch_size=100)`.
+- **Expected:** Batched processing; output equals non-batched equivalent.
+
+#### TC-MAP-08 — `map(batched=True)` that changes row count
+- **Priority:** P1 · **Tags:** map
+- **Steps:**
+  1. Batched fn that splits each row into multiple (data augmentation pattern).
+- **Expected:** Output row count grows accordingly; no length-mismatch error.
+
+#### TC-MAP-09 — `map(with_indices=True)`
+- **Priority:** P1 · **Tags:** map
+- **Steps:**
+  1. `ds.map(lambda x, i: {"idx": i}, with_indices=True)`.
+- **Expected:** Indices match position 0..n-1.
+
+#### TC-MAP-10 — `map(with_rank=True)` under multiprocessing
+- **Priority:** P2 · **Tags:** map, multiprocessing
+- **Steps:**
+  1. `ds.map(fn, with_rank=True, num_proc=2)`.
+- **Expected:** Rank passed; covers full range of worker ranks.
+
+#### TC-MAP-11 — `set_transform` / `with_transform` (on-the-fly)
+- **Priority:** P1 · **Tags:** map
+- **Steps:**
+  1. `ds.with_transform(transform)`; read rows.
+- **Expected:** Transform applied lazily at access; underlying arrow unchanged.
+
+#### TC-MAP-12 — `map` keeps features when fn returns subset
+- **Priority:** P2 · **Tags:** map, arrow
+- **Steps:**
+  1. fn returns only some output keys.
+- **Expected:** Documented merge behaviour; no unexpected column loss.
+
+#### TC-MAP-13 — `filter` empty result
+- **Priority:** P2 · **Tags:** map
+- **Steps:**
+  1. Predicate matching nothing.
+- **Expected:** Returns a valid 0-row dataset; no crash.
+
+#### TC-MAP-14 — `map` error inside fn surfaces clearly
+- **Priority:** P2 · **Tags:** map
+- **Steps:**
+  1. fn raises a Python exception on one row.
+- **Expected:** Error propagates with a traceback identifying the failure (not swallowed).
+
+#### TC-MAP-15 — Dataset vs IterableDataset `map` parity (regression of #5870)
+- **Priority:** P1 · **Tags:** map, streaming, bug, regression
+- **Steps:**
+  1. Apply identical `.map` eagerly and on a streamed version; compare outputs.
+- **Expected:** Same results & semantics across both (guards #5870 behaviour divergence).
+
+#### TC-MAP-16 — `map` caching reuse correctness
+- **Priority:** P1 · **Tags:** map, cache
+- **Steps:**
+  1. Run `.map(fn)` twice in the same session.
+- **Expected:** Second run uses cache; output identical.
+
+### 5.7 Multiprocessing & parallelism (TC-MP) — *flaky-prone; #6936/#6357 open*
+
+---
+
+#### TC-MP-01 — `map(num_proc=4)` completes & matches single-proc `[SMOKE]`
+- **Priority:** P0 · **Tags:** multiprocessing, critical, flaky
+- **Steps:**
+  1. Run `.map(fn)` with `num_proc=1` and `num_proc=4`; compare outputs.
+- **Expected:** Identical results; multi-proc run completes without hang.
+
+#### TC-MP-02 — `num_proc` > dataset length (regression of #2470)
+- **Priority:** P1 · **Tags:** multiprocessing, bug, regression
+- **Steps:**
+  1. `.map(fn, num_proc=16)` on a 4-row dataset.
+- **Expected:** No crash; gracefully uses ≤ len workers (guards #2470).
+
+#### TC-MP-03 — `num_proc` memory usage bounded (regression of #2256)
+- **Priority:** P2 · **Tags:** multiprocessing, flaky, bug
+- **Steps:**
+  1. `.map(fn, num_proc=8)` on a moderately large dataset; monitor RSS.
+- **Expected:** Memory stays within expected bounds; no per-worker full-copy blow-up (#2256).
+
+#### TC-MP-04 — Multiprocessed `map` preserves order
+- **Priority:** P1 · **Tags:** multiprocessing
+- **Steps:**
+  1. `.map` with indices via `num_proc>1`.
+- **Expected:** Output row order matches input order after recombination.
+
+#### TC-MP-05 — `filter(num_proc>1)` no hang `[SMOKE]`
+- **Priority:** P0 · **Tags:** multiprocessing, critical, flaky
+- **Steps:**
+  1. `.filter(pred, num_proc=4)`.
+- **Expected:** Completes within timeout; correct subset (guards #2600 filter mp crash).
+
+#### TC-MP-06 — `map(num_proc>1)` with caching
+- **Priority:** P1 · **Tags:** multiprocessing, cache
+- **Steps:**
+  1. Run multiproc `.map` twice.
+- **Expected:** Per-shard cache files reused on the second run.
+
+#### TC-MP-07 — Custom multiprocessing context (relates to #6357)
+- **Priority:** P2 · **Tags:** multiprocessing
+- **Steps:**
+  1. If supported, pass a `multiprocessing` context (e.g. spawn) to `.map`.
+- **Expected:** Honoured without deadlock (tracks feature request #6357).
+
+#### TC-MP-08 — `save_to_disk(num_proc>1)` no freeze (regression of #6936)
+- **Priority:** P1 · **Tags:** multiprocessing, arrow, bug, regression
+- **Steps:**
+  1. `ds.save_to_disk(path, num_proc=4)` to local disk.
+- **Expected:** Completes without freezing (guards #6936 class; s3 variant noted as env-dependent).
+
+#### TC-MP-09 — `num_proc=1` equals default behaviour
+- **Priority:** P2 · **Tags:** multiprocessing
+- **Steps:**
+  1. Compare `num_proc=1` vs default (None).
+- **Expected:** Identical results; no spurious process spawning.
+
+#### TC-MP-10 — Exception in worker propagates (no silent hang)
+- **Priority:** P1 · **Tags:** multiprocessing, flaky
+- **Steps:**
+  1. fn raises in one worker during `num_proc>1` map.
+- **Expected:** Error surfaces to main process promptly; pool torn down; no hang.
+
+#### TC-MP-11 — Keyboard-interrupt cleans up workers
+- **Priority:** P2 · **Tags:** multiprocessing, flaky
+- **Steps:**
+  1. Start a long multiproc `.map`; send SIGINT.
+- **Expected:** All child processes terminated; no orphan processes/locks left.
+
+#### TC-MP-12 — Parallel `load_dataset` from multiple processes
+- **Priority:** P2 · **Tags:** multiprocessing, cache, flaky
+- **Steps:**
+  1. Launch N processes loading the same uncached dataset simultaneously.
+- **Expected:** Lock file coordinates them; exactly one prepares cache; no corruption.
+
+#### TC-MP-13 — `map(num_proc>1)` with `with_rank`
+- **Priority:** P2 · **Tags:** multiprocessing
+- **Steps:**
+  1. fn uses `rank` to write per-rank artefacts.
+- **Expected:** Each rank gets a distinct, contiguous rank id.
+
+#### TC-MP-14 — Deterministic output across `num_proc` values
+- **Priority:** P1 · **Tags:** multiprocessing, flaky
+- **Steps:**
+  1. Run identical `.map` at num_proc 1, 2, 4, 8.
+- **Expected:** Output dataset identical regardless of worker count.
+
+### 5.8 trust_remote_code handling (TC-TRC) — *coverage gap; 36 open today*
+
+---
+
+#### TC-TRC-01 — Script dataset blocked without flag `[SMOKE]`
+- **Priority:** P0 · **Tags:** data-loading, critical
+- **Preconditions:** A dataset that requires a remote/local loading script.
+- **Steps:**
+  1. `load_dataset(script_dataset)` WITHOUT `trust_remote_code`.
+- **Expected:** Raises/prompts requiring explicit `trust_remote_code=True`; arbitrary code is
+  NOT executed by default (security-critical).
+
+#### TC-TRC-02 — Script dataset loads with flag `[SMOKE]`
+- **Priority:** P0 · **Tags:** data-loading, critical
+- **Steps:**
+  1. `load_dataset(script_dataset, trust_remote_code=True)`.
+- **Expected:** Loads successfully; documented behaviour.
+
+#### TC-TRC-03 — `trust_remote_code=False` explicit
+- **Priority:** P1 · **Tags:** data-loading
+- **Steps:**
+  1. `load_dataset(script_dataset, trust_remote_code=False)`.
+- **Expected:** Explicitly refuses to run the script; clear error.
+
+#### TC-TRC-04 — Pure-data dataset ignores the flag
+- **Priority:** P1 · **Tags:** data-loading
+- **Steps:**
+  1. Load a parquet/csv-only dataset with and without the flag.
+- **Expected:** Identical results; no script => flag is a no-op (no spurious prompt).
+
+#### TC-TRC-05 — Flag still accepted / not removed (regression of #7723)
+- **Priority:** P1 · **Tags:** data-loading, regression
+- **Steps:**
+  1. Call `load_dataset(..., trust_remote_code=True)` on the version under test.
+- **Expected:** Argument is accepted (not removed/deprecated-to-error) — guards #7723.
+
+#### TC-TRC-06 — Streaming + trust_remote_code (regression of #7692)
+- **Priority:** P1 · **Tags:** streaming, data-loading, regression
+- **Steps:**
+  1. `load_dataset(script, streaming=True, trust_remote_code=True)`; iterate.
+- **Expected:** Streams correctly without decode error (guards #7692). (Dup of TC-STR-17 cross-ref.)
+
+#### TC-TRC-07 — Loading-script dataset with config + remote code
+- **Priority:** P2 · **Tags:** data-loading
+- **Steps:**
+  1. Load a script dataset specifying a config name with the flag set.
+- **Expected:** Correct config loads; code executed only because flag was set.
+
+#### TC-TRC-08 — Cached script dataset reuse honours flag
+- **Priority:** P2 · **Tags:** data-loading, cache
+- **Steps:**
+  1. Load script dataset with flag (cache built); reload.
+- **Expected:** Reuse from cache; no re-execution surprise; consistent results.
+
+#### TC-TRC-09 — Clear, actionable error message text
+- **Priority:** P2 · **Tags:** data-loading
+- **Steps:**
+  1. Trigger the block; read the error/prompt message.
+- **Expected:** Message names the dataset and tells the user exactly how to opt in.
+
+#### TC-TRC-10 — Env/global default behaviour
+- **Priority:** P2 · **Tags:** data-loading
+- **Steps:**
+  1. Check whether any global setting can pre-authorize; verify documented default is "deny".
+- **Expected:** Secure-by-default; any opt-in mechanism behaves as documented.
+
+### 5.9 Cross-cutting / parity (TC-XCUT)
+
+---
+
+#### TC-XCUT-01 — Same dataset loads normal AND streaming `[SMOKE]`
+- **Priority:** P0 · **Tags:** streaming, data-loading, critical
+- **Steps:**
+  1. Load a dataset both ways; compare schema + first rows.
+- **Expected:** Both succeed; schemas match (guards #2923/#2866 "works in one mode only").
+
+#### TC-XCUT-02 — Features schema identical eager vs streaming
+- **Priority:** P1 · **Tags:** streaming, arrow
+- **Steps:**
+  1. Compare `.features` from eager and streamed loads.
+- **Expected:** Equal feature definitions.
+
+#### TC-XCUT-03 — `map` output parity eager vs streaming
+- **Priority:** P1 · **Tags:** map, streaming, regression
+- **Steps:**
+  1. Same transform applied both ways; compare first N rows. (See TC-MAP-15.)
+- **Expected:** Identical transformed rows.
+
+#### TC-XCUT-04 — Error parity across modes
+- **Priority:** P2 · **Tags:** streaming, data-loading
+- **Steps:**
+  1. Trigger the same fault (e.g. bad column) in eager and streaming.
+- **Expected:** Comparable, clear errors in both modes (no "silent in streaming" divergence).
+
+#### TC-XCUT-05 — Version round-trip compatibility
+- **Priority:** P2 · **Tags:** arrow, cache, regression
+- **Steps:**
+  1. `save_to_disk` with the version under test; `load_from_disk` it back.
+- **Expected:** Self-consistent round-trip; documented behaviour for older on-disk formats.
+
+#### TC-XCUT-06 — `with_format` consistency (torch/numpy/pandas)
+- **Priority:** P1 · **Tags:** arrow
+- **Steps:**
+  1. `ds.with_format("numpy"/"torch"/"pandas")`; read a row.
+- **Expected:** Correct container types per format; values numerically equal across formats.
+
+---
+
+## 6. Traceability matrix (category → cases → evidence)
+
+| Focus area | Test cases | Historical issues anchored | Live (2026) issues anchored |
+|---|---|---|---|
+| Data loading | TC-DL-01..18 (18) | #2882, #2937, #2941, #2918 | #6829, #6937 |
+| Streaming | TC-STR-01..18 (18) | #2923, #2866, #2944, #2918 | #7692 |
+| Caching | TC-CACHE-01..14 (14) | #2496, #2904, #2775, #2591, #2937, #2471 | #4526 |
+| Arrow serialization | TC-ARROW-01..16 (16) | #2831, #2768, #2799, #2892, #2591 | #7037 |
+| Splits & slicing | TC-SPLIT-01..16 (16) | #676, #1600, #1622, #2941 | #8269, #4526 |
+| Map & filter | TC-MAP-01..16 (16) | #2927, #2950 | #5870, #6787, #6789 |
+| Multiprocessing | TC-MP-01..14 (14) | #2470, #2256, #2600 | #6936, #6357 |
+| trust_remote_code | TC-TRC-01..10 (10) | (n/a — feature post-dates snapshot) | #7723, #7692, #7531 |
+| Cross-cutting | TC-XCUT-01..06 (6) | #2923, #2866 | #5870 |
+| **Total** | **128 cases** | | |
+
+---
+
+## 7. Data sources & method notes
+
+- **Historical analysis:** `github-bug-reports-and-issues.jsonl` — 984 issues (PRs excluded),
+  `huggingface/datasets`, snapshot 2020-04 → 2021-09. Categorized by keyword/label matching
+  against the `qa_brief.yaml` focus areas (issues may match multiple categories).
+- **Live cross-check:** GitHub Search API, June 2026, `repo:huggingface/datasets is:issue
+  is:open` queries per focus area. Current issue numbers reach #8269.
+- **PR-structure context:** The supplied `open-source-pull-requests-with-documentation-changes.json`
+  turned out to be a transformers *model-class → SHA* mapping, **not** PR templates. PR
+  conventions were therefore derived from the 2035 real PRs inside the `.jsonl` snapshot
+  (concise descriptive title + explanatory body + `Fix #NNNN` issue references).
+- **Known limitation:** Live exact open-counts for `data loading`, `caching`, `arrow`, and
+  `streaming` were not captured cleanly because some issue bodies contained control characters
+  that broke strict JSON parsing during collection. These remain high-priority on volume +
+  core-surface grounds; the gap is in the *count*, not the prioritization.