Added strict_priority flag to ts_merge to tile series without interior nan filling.

Eli · Eli · commit cf6bf64a9736 · 2025-08-28T15:19:32.000-07:00
diff --git a/docsrc/notebooks/merge_splice.ipynb b/docsrc/notebooks/merge_splice.ipynb
@@ -8,7 +8,7 @@
     "# Merging and Splicing Time Series\n",
     "This tutorial demonstrates the usage and difference between `ts_merge` and `ts_splice`, two methods for folding together time series into a combined data structure.\n",
     "\n",
-    "- **`ts_merge`** blends multiple time series together based on priority, filling missing values. It potentiallyu uses all the input series at all timestamps.\n",
+    "- **`ts_merge`** blends multiple time series together based on priority, optionally filling missing values in higher priority series with entries from lower priority. It potentially uses all the input series at all timestamps. See the [`strict_priority`](#ts_merge-strict-priority-option) option below for advanced control over nan-filling between priorities.\n",
     "- **`ts_splice`** stitches together time series in sequential time **blocks** without mixing values.\n",
     "\n",
     "We will describe the effect on regularly sampled series (which have the  `freq` attribute) and on irregular. We will also  explore the **`names`** argument, which controls how columns are selected or renamed in the merging/splicing process. There is a file-level command line tools for this as well in the `dms_datastore` package.\n",
@@ -1012,6 +1012,95 @@
     "\n",
     "This notebook provides a clear comparison to help you decide which method best suits your use case.\n"
    ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "# `ts_merge`: strict priority option\n",
+    "**New option**: `strict_priority` (default `False`) enforces that a higher‑priority series dominates between its `first_valid_index` and `last_valid_index`.\n",
+    "\n",
+    "**Semantics**\n",
+    "- Per **column**, define the dominance window as `[first_valid_index, last_valid_index]`.\n",
+    "- Within that window, lower‑priority series are **masked**, even if the higher‑priority value is `NaN`.\n",
+    "- Outside those windows, merging is unchanged and lower priority may contribute.\n",
+    "- With irregular inputs, timestamps that exist **only** in lower‑priority series **and** are fully masked inside a dominance window are dropped; timestamps from the top series' index are preserved even if all‑`NaN`.\n",
+    "\n",
+    "**`names` behavior** is unchanged.\n",
+    "### Example 1 — Series with interior `NaN`\n",
+    "\n",
+    "```python\n",
+    "import numpy as np, pandas as pd\n",
+    "from vtools.functions.merge import ts_merge\n",
+    "\n",
+    "idx1 = pd.date_range(\"2023-01-01\", periods=5, freq=\"D\")\n",
+    "idx2 = pd.date_range(\"2023-01-03\", periods=5, freq=\"D\")\n",
+    "s1 = pd.Series([1, 2, np.nan, 4, 5], index=idx1, name=\"A\")\n",
+    "s2 = pd.Series([10, 20, 30, np.nan, 50], index=idx2, name=\"A\")\n",
+    "\n",
+    "ts_merge((s1, s2))                      # default\n",
+    "ts_merge((s1, s2), strict_priority=True)\n",
+    "```\n",
+    "### Example 2 — Two columns, per‑column dominance\n",
+    "\n",
+    "```python\n",
+    "idx1 = pd.date_range(\"2023-01-01\", periods=5, freq=\"D\")\n",
+    "idx2 = pd.date_range(\"2023-01-03\", periods=5, freq=\"D\")\n",
+    "df1 = pd.DataFrame({\"A\":[1., np.nan, 3., 4., 5.]}, index=idx1)\n",
+    "df1[\"B\"] = df1[\"A\"]\n",
+    "df1.loc[idx1[2], \"B\"] = np.nan  # interior NaN in high‑priority B\n",
+    "df2 = pd.DataFrame({\"A\":[10., 20., np.nan, 40., 50.]}, index=idx2)\n",
+    "df2[\"B\"] = df2[\"A\"]\n",
+    "\n",
+    "ts_merge((df1, df2), strict_priority=True)[[\"A\",\"B\"]]\n",
+    "```\n",
+    "### Example 3 — Irregular inputs\n",
+    "\n",
+    "```python\n",
+    "idx1 = pd.to_datetime([\"2023-01-01\",\"2023-01-03\",\"2023-01-07\",\"2023-01-10\"])\n",
+    "idx2 = pd.to_datetime([\"2023-01-02\",\"2023-01-04\",\"2023-01-08\",\"2023-01-11\"])\n",
+    "s1 = pd.Series([1.,2.,3.,4.], index=idx1, name=\"A\")\n",
+    "s2 = pd.Series([10.,20.,30.,40.], index=idx2, name=\"A\")\n",
+    "\n",
+    "ts_merge((s1, s2), strict_priority=True)\n",
+    "```\n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "import numpy as np, pandas as pd\n",
+    "from vtools.functions.merge import ts_merge\n",
+    "\n",
+    "# Example 1\n",
+    "idx1 = pd.date_range(\"2023-01-01\", periods=5, freq=\"D\")\n",
+    "idx2 = pd.date_range(\"2023-01-03\", periods=5, freq=\"D\")\n",
+    "s1 = pd.Series([1, 2, np.nan, 4, 5], index=idx1, name=\"A\")\n",
+    "s2 = pd.Series([10, 20, 30, np.nan, 50], index=idx2, name=\"A\")\n",
+    "print(\"Example 1 strict=False:\")\n",
+    "print(ts_merge((s1, s2)))\n",
+    "print(\"Example 1 strict=True:\")\n",
+    "print(ts_merge((s1, s2), strict_priority=True))\n",
+    "\n",
+    "# Example 2\n",
+    "df1 = pd.DataFrame({\"A\":[1., np.nan, 3., 4., 5.]}, index=idx1)\n",
+    "df1[\"B\"] = df1[\"A\"]; df1.loc[idx1[2], \"B\"] = np.nan\n",
+    "df2 = pd.DataFrame({\"A\":[10., 20., np.nan, 40., 50.]}, index=idx2)\n",
+    "df2[\"B\"] = df2[\"A\"]\n",
+    "print(\"\\nExample 2 strict=True:\")\n",
+    "print(ts_merge((df1, df2), strict_priority=True)[[\"A\",\"B\"]])\n",
+    "\n",
+    "# Example 3\n",
+    "idx1i = pd.to_datetime([\"2023-01-01\",\"2023-01-03\",\"2023-01-07\",\"2023-01-10\"])\n",
+    "idx2i = pd.to_datetime([\"2023-01-02\",\"2023-01-04\",\"2023-01-08\",\"2023-01-11\"])\n",
+    "s1i = pd.Series([1.,2.,3.,4.], index=idx1i, name=\"A\")\n",
+    "s2i = pd.Series([10.,20.,30.,40.], index=idx2i, name=\"A\")\n",
+    "print(\"\\nExample 3 strict=True:\")\n",
+    "print(ts_merge((s1i, s2i), strict_priority=True))\n"
+   ]
   }
  ],
  "metadata": {
diff --git a/tests/test_merge_splice.py b/tests/test_merge_splice.py
@@ -301,3 +301,42 @@ def test_non_datetime_index(self):
         df2 = pd.DataFrame({"A": [4, 5, 6]}, index=[2, 3, 4])
         with pytest.raises(ValueError, match="All input series must have a DatetimeIndex."):
             ts_merge((df1, df2))
+
+
+# ----------------------------------------------------------------------
+# Additional tests for strict_priority behavior in ts_merge
+# ----------------------------------------------------------------------
+
+def test_ts_merge_strict_priority_series_window(sample_data):
+    s1, s2 = sample_data["series1"], sample_data["series2"]
+    # s1 dominates Jan1..Jan5; its NaN on Jan3 remains NaN; s2 cannot fill it.
+    result = ts_merge((s1, s2), strict_priority=True)
+    expected_index = s1.index.union(s2.index, sort=False).sort_values()
+    expected = pd.Series([1., 2., np.nan, 4., 5., np.nan, 50.], index=expected_index, name="A")
+    pd.testing.assert_series_equal(result, expected)
+
+def test_ts_merge_strict_priority_dataframe_per_column(sample_data):
+    df1, df2 = sample_data["df1"], sample_data["df2"]
+    # Create multi-column frames to exercise per-column dominance
+    df1m = pd.concat([df1, df1.rename(columns={"A": "B"})], axis=1)
+    df2m = pd.concat([sample_data["df2"], sample_data["df2"].rename(columns={"A": "B"})], axis=1)
+    # Insert an interior NaN in higher-priority B column to ensure NaN is not backfilled
+    df1m.loc[df1m.index[2], "B"] = np.nan
+    result = ts_merge((df1m, df2m), strict_priority=True)
+    expected_index = df1m.index.union(df2m.index, sort=False).sort_values()
+    exp = pd.DataFrame(index=expected_index, columns=["A", "B"], dtype=float)
+    # Column A: df1 covers first window fully; df2 only contributes after the window
+    exp["A"] = [1., np.nan, 3., 4., 5., 40., 50.]
+    # Column B: an interior NaN in df1's window must remain NaN
+    exp["B"] = [1., np.nan, np.nan, 4., 5., 40., 50.]
+    pd.testing.assert_frame_equal(result[["A", "B"]], exp)
+
+def test_ts_merge_strict_priority_irregular(irregular_sample_data):
+    s1 = irregular_sample_data["series1"]
+    s2 = irregular_sample_data["series2"]
+    # s1 window [first_valid, last_valid] excludes s2 within; s2 contributes only after.
+    result = ts_merge((s1, s2), strict_priority=True)
+    expected = pd.Series([1., 2., 3., 4., 40.],
+                         index=pd.to_datetime(["2023-01-01","2023-01-03","2023-01-07","2023-01-10","2023-01-11"]),
+                         name="A")
+    pd.testing.assert_series_equal(result, expected)
diff --git a/vtools/functions/merge.py b/vtools/functions/merge.py