lbl-camera
diff --git a/‎CLAUDE.md‎
Lines changed: 39 additions & 2 deletions b/‎CLAUDE.md‎
Lines changed: 39 additions & 2 deletions
diff --git a/‎examples/gp2ScaleTest.ipynb‎
Lines changed: 16 additions & 2 deletions b/‎examples/gp2ScaleTest.ipynb‎
Lines changed: 16 additions & 2 deletions
diff --git a/‎fvgp/fvgp.py‎
Lines changed: 2 additions & 2 deletions b/‎fvgp/fvgp.py‎
Lines changed: 2 additions & 2 deletions
diff --git a/‎fvgp/gp.py‎
Lines changed: 37 additions & 7 deletions b/‎fvgp/gp.py‎
Lines changed: 37 additions & 7 deletions
diff --git a/‎fvgp/gp_data.py‎
Lines changed: 27 additions & 10 deletions b/‎fvgp/gp_data.py‎
Lines changed: 27 additions & 10 deletions
@@ -35,14 +35,32 @@ Both classes are composed of internal specialist objects created at `__init__` t
 
 | Class | File | Responsibility |
 |---|---|---|
-| `GPdata` | [gp_data.py](fvgp/gp_data.py) | Data validation, shape tracking, Euclidean vs. non-Euclidean |
-| `GPprior` | [gp_prior.py](fvgp/gp_prior.py) | Kernel and mean function; default is anisotropic Matérn with ARD |
+| `GPdata` | [gp_data.py](fvgp/gp_data.py) | Data validation, shape tracking, Euclidean vs. non-Euclidean. Sole source of truth for `x_data`, `y_data`, `noise_variances`, plus the pre-append snapshot (`x_old`, `y_old`, `noise_variances_old`) and last-appended chunk (`x_new`, `y_new`, `noise_variances_new`) |
+| `GPprior` | [gp_prior.py](fvgp/gp_prior.py) | Kernel and mean function (default: anisotropic Matérn with ARD). In gp2Scale mode also owns `x_data_scatter_future` (the persistent dask scatter of `x_data`) |
 | `GPlikelihood` | [gp_likelihood.py](fvgp/gp_likelihood.py) | Noise model (variances or callable) |
 | `GPkv` | [gp_kv.py](fvgp/gp_kv.py) | Owns K+V matrix state and all factorizations; dispatches solves/logdets across linalg modes |
 | `GPMarginalLikelihood` | [gp_marginal_likelihood.py](fvgp/gp_marginal_likelihood.py) | Log marginal likelihood and its gradient; delegates factorization to `GPkv` |
 | `GPposterior` | [gp_posterior.py](fvgp/gp_posterior.py) | Posterior mean/covariance; information-theoretic quantities |
 | `GPtraining` | [gp_training.py](fvgp/gp_training.py) | Hyperparameter optimization (scipy, hgdl async, MCMC, Adam) |
 
+### State propagation
+
+Sources of truth: `GPtraining.hyperparameters` and `GPdata.x_data` / `y_data` / `noise_variances`. Everywhere else reads these via `@property`. Cached state that must be invalidated on a change:
+
+| Mutator | What's refreshed |
+|---|---|
+| `GP.set_hyperparameters(hps)` | `trainer.hyperparameters` → `prior.update_state_hyperparameters()` (recomputes `m`, `K`) → `likelihood.update_state()` (`V`) → `kv.update_state_hyperparameters()` (factorization + `KVinvY`) |
+| `GP.update_gp_data(..., append=True)` | `data.update()` snapshots `x_old`/`y_old`/etc. → `prior.augment_state_data()` (rank-n update of `m`, `K`) → `likelihood.update_state()` → `kv.update_state_data(rank_n_update)` |
+| `GP.update_gp_data(..., append=False)` | `data.update()` clears `_old`/`_new` slots → `prior.update_state_data()` (full recompute) → `likelihood.update_state()` → `kv.update_state_data(rank_n_update)` |
+| `GP.train(...)` (sync) / `GP.update_hyperparameters(opt_obj)` (async) | both end with `set_hyperparameters(...)` |
+
+`GPposterior` and `GPMarginalLikelihood` hold **no cached state** — every read goes through properties, so they're automatically consistent.
+
+Gotchas:
+- **`GP.set_args(new_args)` does NOT invalidate `K`, `m`, `V`, or factorizations.** If `args` flows into a user kernel/mean/noise callable, new args take effect only on the next `set_hyperparameters`, `update_gp_data(append=False)`, fresh `train`, or posterior call with explicit `hyperparameters=`. To force a flush: `set_hyperparameters(self.hyperparameters)`.
+- **`update_gp_data(append=False, rank_n_update=True)`** is invalid (the previous factorization is for data that no longer exists); `GP.update_gp_data` emits a `UserWarning` and forces `rank_n_update=False`.
+- **`kv.solve(b, x0=...)`** zero-pads `x0` along axis 0 when shapes don't match, so a pre-append `KVinvY` can warm-start the post-append solve in iterative modes (sparseCG/MINRES/preconditioned variants). See [gp_kv.py:333-342](fvgp/gp_kv.py#L333-L342).
+
 ### Key supporting modules
 
 - **[gp_lin_alg.py](fvgp/gp_lin_alg.py)** — CPU/GPU linear algebra primitives; Cholesky, LU, sparse solvers; defines `NonPositiveDefiniteError`
@@ -55,6 +73,25 @@ Both classes are composed of internal specialist objects created at `__init__` t
 
 When `gp2Scale=True`, `GP` switches to a Wendland (compactly supported) kernel producing sparse covariance matrices and uses Dask for distributed computation. This path requires a Dask client to be passed in and uses sparse linear solvers instead of dense Cholesky.
 
+**Scatter ownership and lifecycle:**
+
+- `GPprior.x_data_scatter_future` is the single persistent dask scatter of the current `x_data`. Scattered once at `GPprior.__init__` (see [gp_prior.py:93-96](fvgp/gp_prior.py#L93-L96)).
+- `GPdata` does NOT scatter — it's pure-Python data only.
+- `_compute_prior_covariance_gp2Scale` reads `self.x_data_scatter_future` directly; **no scatter per call**, so training stays dask-quiet.
+- On data changes, `augment_state_data` / `update_state_data` refresh the scatter by **overwriting** `self.x_data_scatter_future` (no explicit `release()`). The old future loses its only Python ref and is cleaned up via `__del__`. Calling `release()` explicitly schedules a `_dec_ref` that races against subsequent scatter `replicate` operations in the scheduler — don't do it.
+- `_update_prior_covariance_gp2Scale` (the augment path) uses `self.x_data_scatter_future` for the `x_old` side (no content-hash collision since it shares the existing key) and scatters only `x_new` locally, releasing that local future at the end.
+
+**Cross-instance race guard:** [gp.py:14-21](fvgp/gp.py#L14-L21) defines `_GP_INSTANCES_PER_CLIENT`, a `WeakValueDictionary` keyed by `dask_client.id`. `GP.__init__` ([gp.py:285-303](fvgp/gp.py#L285-L303)) raises with a descriptive remediation message if you try to construct a second gp2Scale `GP` on a client that already has a live one — that pattern reliably triggers `FutureCancelledError`/`KeyError` from the scheduler. To reuse a client for a sequence of GPs:
+
+```python
+import gc
+del previous_gp
+gc.collect()
+client.run(lambda: None)  # flush pending releases
+```
+
+The `test_gp2Scale` test uses exactly this pattern between linalg-mode iterations.
+
 ### Customization API
 
 Kernels, mean functions, and noise models are all plain Python callables with standardized signatures. Users pass them as arguments to `GP`/`fvGP` constructors. The full hyperparameter vector is shared across kernel, mean, and noise callables, but each callable must only read its reserved index range. Kernel gradients can be user-supplied or computed via finite differences.
 
@@ -98,7 +98,7 @@
   },
   {
    "cell_type": "code",
-   "execution_count": null,
+   "execution_count": 8,
    "id": "fe1f017a",
    "metadata": {},
    "outputs": [
@@ -111,8 +111,22 @@
       "Finished  20  out of  100  iterations. f(x)=  -12722.285286212222\n",
       "Finished  30  out of  100  iterations. f(x)=  -12485.189964773206\n",
       "Finished  40  out of  100  iterations. f(x)=  -12473.181834409832\n",
-      "Finished  50  out of  100  iterations. f(x)=  -12473.181834409832\n"
+      "Finished  50  out of  100  iterations. f(x)=  -12473.181834409832\n",
+      "Finished  60  out of  100  iterations. f(x)=  -12466.485887381574\n",
+      "Finished  70  out of  100  iterations. f(x)=  -12460.7203909633\n",
+      "Finished  80  out of  100  iterations. f(x)=  -12460.7203909633\n",
+      "Finished  90  out of  100  iterations. f(x)=  -12460.7203909633\n"
      ]
+    },
+    {
+     "data": {
+      "text/plain": [
+       "array([0.14048159, 0.03980175])"
+      ]
+     },
+     "execution_count": 8,
+     "metadata": {},
+     "output_type": "execute_result"
     }
    ],
    "source": [
 
@@ -154,12 +154,12 @@ class fvGP(GP):
         If no kernel is provided, the ``compute_device`` option should be revisited.
         The default kernel will use the specified device to compute covariances.
         The default is False.
+    gp2Scale_batch_size : int, optional
+        Matrix batch size for distributed computing in gp2Scale. The default is 10000.
     dask_client : dask.distributed.Client, optional
         A dask client for gp2Scale, asynchronous training,a nd certain linear algebra operations.
         On HPC architecture, this client is provided by the job script. Please have a look at the examples.
         A local client is used as the default.
-    gp2Scale_batch_size : int, optional
-        Matrix batch size for distributed computing in gp2Scale. The default is 10000.
     linalg_mode : str, optional
         Controls the linear-algebra backend used to solve (K+V)x=b and compute log|K+V|.
         The default is ``None``, which selects ``"Chol"`` for standard GPs and automatically
 
@@ -1,4 +1,5 @@
 import warnings
+import weakref
 import numpy as np
 from loguru import logger
 from distributed import Client
@@ -13,6 +14,12 @@
 import importlib
 warnings.simplefilter("once", UserWarning)
 
+# Tracks live GP instances per dask client (gp2Scale mode only).  Used to detect
+# the case where a user creates a second GP on a client that still has a live GP,
+# which triggers race conditions between the new init scatter and the pending
+# `_dec_ref` callbacks from the previous GP's scatter activity.
+_GP_INSTANCES_PER_CLIENT = weakref.WeakValueDictionary()
+
 # TODO: also search below "TODO"
 # Appends and rank_n_updates for gp2Scale are not yet fully tested. Have to check the compute graph and test (what does rank_n_update even mean for the different modes? ). 
 
@@ -147,12 +154,12 @@ class GP:
         If no kernel is provided, the ``compute_device`` option should be revisited.
         The default kernel will use the specified device to compute covariances.
         The default is False.
+    gp2Scale_batch_size : int, optional
+        Matrix batch size for distributed computing in gp2Scale. The default is 10000.
     dask_client : dask.distributed.Client, optional
         A dask client for gp2Scale, asynchronous training,a nd certain linear algebra operations.
         On HPC architecture, this client is provided by the job script. Please have a look at the examples.
         A local client is used as the default.
-    gp2Scale_batch_size : int, optional
-        Matrix batch size for distributed computing in gp2Scale. The default is 10000.
     linalg_mode : str, optional
         Controls the linear-algebra backend used to solve (K+V)x=b and compute log|K+V|.
         The default is ``None``, which selects ``"Chol"`` for standard GPs and automatically
@@ -274,6 +281,27 @@ def __init__(
         # Check gp2Scale
         dask_client = self.initialize_gp2Scale_dask_client(gp2Scale, dask_client)
 
+        # Race-condition guard: in gp2Scale mode, only one GP can be alive per dask
+        # client.  Sharing a client between live GPs causes the new GP's init scatter
+        # to race against the previous GP's pending `_dec_ref` callbacks, surfacing as
+        # `FutureCancelledError` or `KeyError` from the scheduler.
+        if gp2Scale and dask_client is not None:
+            existing = _GP_INSTANCES_PER_CLIENT.get(dask_client.id)
+            if existing is not None and existing is not self:
+                raise Exception(
+                    f"Another GP instance is already active on this dask client "
+                    f"(client.id={dask_client.id!r}). Sharing a dask client between "
+                    f"multiple live GPs in gp2Scale mode triggers race conditions "
+                    f"in the scheduler's scatter reference counting.\n"
+                    f"To reuse the same client for a sequence of GPs, destroy the "
+                    f"previous one first:\n"
+                    f"    import gc\n"
+                    f"    del previous_gp\n"
+                    f"    gc.collect()\n"
+                    f"    client.run(lambda: None)  # flush pending releases\n"
+                    f"Or use a fresh dask client per GP."
+                )
+
         ########################################
         ###init data instance [tier 1]##########
         ########################################
@@ -359,6 +387,11 @@ def __init__(
                                      self.kv,
                                      self.likelihood)
 
+        # Register this instance for the cross-instance race-condition guard above.
+        # Entry is removed automatically when self is garbage-collected.
+        if gp2Scale and dask_client is not None:
+            _GP_INSTANCES_PER_CLIENT[dask_client.id] = self
+
     #########PROPERTIES#########################################
     @property
     def x_data(self):
@@ -499,7 +532,6 @@ def update_gp_data(
         assert isinstance(noise_variances_new, np.ndarray) or noise_variances_new is None, \
             "wrong format in noise_variances_new"
         assert len(x_new) == len(y_new), "updated x and y do not have the same lengths."
-        old_x_data = self.x_data.copy()
         if rank_n_update is None: rank_n_update = append
         if not append and rank_n_update:
             warnings.warn("`rank_n_update=True` is invalid when `append=False` "
@@ -510,10 +542,8 @@ def update_gp_data(
         self.data.update(x_new, y_new, noise_variances_new, append=append)
 
         # update prior
-        if append:
-            self.prior.augment_state_data(old_x_data, x_new)
-        else:
-            self.prior.update_state_data()
+        if append: self.prior.augment_state_data()
+        else:self.prior.update_state_data()
 
         # update likelihood
         self.likelihood.update_state()
 
@@ -37,6 +37,12 @@ def __init__(self, x_data, y_data,
         self.x_data = x_data
         self.y_data = y_data
         self.noise_variances = noise_variances
+        self.x_new = None
+        self.y_new = None
+        self.noise_variances_new = None
+        self.x_old = None
+        self.y_old = None
+        self.noise_variances_old = None
         self.point_number = len(self.x_data)
         self._check_for_nan()
         self.fvgp_x_data = None
@@ -48,12 +54,9 @@ def __init__(self, x_data, y_data,
         self.gp2Scale = gp2Scale
         self.compute_device = compute_device
         self.dask_client = dask_client
-        self.x_data_scatter_future = None
         self.compute_workers = []
         if gp2Scale and dask_client is not None:
             self.compute_workers = list(dask_client.scheduler_info()["workers"].keys())
-            self.x_data_scatter_future = dask_client.scatter(
-                self.x_data, workers=self.compute_workers, broadcast=True, direct=True)
 
     def set_fvgp_data(self, fvgp_x_data, fvgp_y_data, fvgp_noise_variances, x_out):
         self.fvgp_x_data = fvgp_x_data
@@ -91,19 +94,28 @@ def update(self, x_data_new, y_data_new, noise_variances_new=None, append=True):
             self.x_data = x_data_new
             self.y_data = y_data_new
             self.noise_variances = noise_variances_new
+            self.x_old = None
+            self.y_old = None
+            self.noise_variances_old = None
+            self.x_new = None
+            self.y_new = None
+            self.noise_variances_new = None
         else:
+            self.x_old = self.x_data
+            self.y_old = self.y_data
+            self.noise_variances_old = self.noise_variances
+            self.x_new = x_data_new
+            self.y_new = y_data_new
+            self.noise_variances_new = noise_variances_new
             if self.Euclidean: self.x_data = np.vstack([self.x_data, x_data_new])
             else: self.x_data = self.x_data + x_data_new
             self.y_data = np.vstack([self.y_data, y_data_new])
             if isinstance(noise_variances_new, np.ndarray):
                 self.noise_variances = np.append(self.noise_variances, noise_variances_new)
         self.point_number = len(self.x_data)
         self._check_for_nan()
-        if not append and self.gp2Scale and self.dask_client is not None:
-            if self.x_data_scatter_future is not None:
-                self.x_data_scatter_future.release()
-            self.x_data_scatter_future = self.dask_client.scatter(
-                self.x_data, workers=self.compute_workers, broadcast=True, direct=True)
+
+
 
     def _check_for_nan(self):
         if self.Euclidean:
@@ -113,9 +125,15 @@ def __getstate__(self):
         state = dict(
             x_data=self.x_data,
             y_data=self.y_data,
+            noise_variances=self.noise_variances,
             Euclidean=self.Euclidean,
             index_set_dim=self.index_set_dim,
-            noise_variances=self.noise_variances,
+            x_new=self.x_new,
+            y_new=self.y_new,
+            noise_variances_new=self.noise_variances_new,
+            x_old=self.x_old,
+            y_old=self.y_old,
+            noise_variances_old=self.noise_variances_old,
             point_number=self.point_number,
             fvgp_x_data=self.fvgp_x_data,
             fvgp_y_data=self.fvgp_y_data,
@@ -127,7 +145,6 @@ def __getstate__(self):
             gp2Scale=self.gp2Scale,
             compute_device=self.compute_device,
             dask_client=None,
-            x_data_scatter_future=None,
             compute_workers=self.compute_workers,
             )
         return state