Diagnose stalled mito training

Donglai Wei · Donglai Wei · commit 0124239b560f · 2026-02-18T21:18:56.000-05:00
diff --git a/ABISS_USAGE_SUMMARY.md b/ABISS_USAGE_SUMMARY.md
@@ -0,0 +1,151 @@
+# ABISS Usage Summary
+
+This note summarizes how to build and run ABISS from `/Users/weidf/Code/lib/abiss`, based on the current scripts and source in that repo.
+
+## What ABISS Is
+
+ABISS (Affinity Based Image Segmentation System) is a chunked 3D segmentation pipeline:
+
+1. Watershed over affinity maps (`ws` stage).
+2. Agglomeration/merge of watershed fragments (`agg` stage, typically `me` op).
+3. Optional contact-surface extraction (`cs` op).
+
+Pipeline orchestration is driven by shell scripts in `/Users/weidf/Code/lib/abiss/scripts`.
+
+## Build
+
+From the ABISS repo:
+
+```bash
+cd /Users/weidf/Code/lib/abiss
+mkdir -p build
+cd build
+cmake ..
+make -j"$(nproc)"
+```
+
+Key binaries produced in `build/`:
+
+- `ws`, `ws2`, `ws3`
+- `acme`, `meme`, `agg`, `agg_nonoverlap`, `agg_overlap`, `agg_extra`
+- `split_remap`, `match_chunks`, `reduce_chunk`, `size_map`, `evaluate`
+- `accs`, `mecs`, `assort`
+
+## Runtime Layout and Entry Scripts
+
+Primary entrypoints:
+
+- `scripts/run_batch.sh <op> <num_composite_layers> <root_tag>`
+- `scripts/remap_batch.sh <op> <num_composite_layers_unused> <root_tag>`
+
+Where:
+
+- `<op>` maps to script names:
+  - `ws` -> `atomic_chunk_ws.sh`, `composite_chunk_ws.sh`, `remap_chunk_ws.sh`
+  - `me` -> `atomic_chunk_me.sh`, `composite_chunk_me.sh`, `remap_chunk_agg.sh`
+  - `cs` -> `atomic_chunk_cs.sh`, `composite_chunk_cs.sh` (no remap script in batch wrapper)
+- `<root_tag>` is chunk tag format: `mip_x_y_z` (example: `0_0_0_0`).
+
+## Required Environment Conventions
+
+`scripts/init.sh` expects:
+
+- `WORKER_HOME` (defaults to `/workspace/seg`)
+- `SECRETS` directory containing a parameter JSON file named `param`
+  - `PARAM_JSON` is set to `$SECRETS/param`
+- `STAGE` must be exported by caller (`ws`, `agg`, optionally `cs`)
+- For `me` scripts, `OVERLAP` must be exported (`0`, `1`, or `2`)
+  - `atomic_chunk_me.sh` and `composite_chunk_me.sh` use `set -u`, so unset `OVERLAP` will fail.
+
+`init.sh` auto-generates `$SECRETS/config.sh` by running `scripts/set_env.py $PARAM_JSON` (once), then sources it.
+
+## Parameter JSON (`$SECRETS/param`)
+
+### Core keys (practically required)
+
+- `NAME`
+- `BBOX`
+- `CHUNK_SIZE`
+- `AFF_PATH`
+- `AFF_RESOLUTION`
+- `WS_HIGH_THRESHOLD`
+- `WS_LOW_THRESHOLD`
+- `WS_SIZE_THRESHOLD`
+- `AGG_THRESHOLD`
+- `SCRATCH_PATH`
+- `WS_PREFIX`, `SEG_PREFIX` (or explicit `WS_PATH`, `SEG_PATH`)
+
+### Highly recommended / stage-specific
+
+- `CHUNKMAP_OUTPUT` for watershed remap output upload
+- `CHUNKMAP_INPUT` (optional; defaults to `${SCRATCH_PATH}/ws/chunkmap`)
+- `WS_DUST_THRESHOLD` (defaults to `WS_SIZE_THRESHOLD`)
+- `REMAP_SIZE_MAP_THRESHOLD` (defaults to `100000`)
+- `SEM_PATH`, `SEMANTIC_WS`, `SEM_FILL_MISSING`
+- `AFF_FILL_MISSING`, `WS_FILL_MISSING`, `SEG_FILL_MISSING`
+- `GT_PATH`, `CLEFT_PATH` (used by eval path in remap agg)
+- `CHUNKED_AGG_OUTPUT`, `CHUNKED_SEG_PATH`
+- `UPLOAD_CMD`, `DOWNLOAD_CMD` (auto-derived if missing)
+- `REDIS_SERVER`, `REDIS_DB` (task state tracking; otherwise fallback to scratch `done/` files)
+
+### Affinity channel expectations
+
+In `cut_chunk_common.py`:
+
+- 1 channel: interpreted as probability map and converted to 3 affinities.
+- 3 channels: used as affinities.
+- 4 channels: first 3 affinity + 1 myelin.
+- N channels: first `AFF_CHANNELS` channels (default `3`).
+
+## Minimal End-to-End Run Sequence
+
+Assuming:
+
+- ABISS repo at `/Users/weidf/Code/lib/abiss`
+- your param JSON available at `$SECRETS/param`
+- root chunk tag `0_0_0_0`
+- composite layers `3` (example)
+
+```bash
+export WORKER_HOME=/Users/weidf/Code/lib/abiss
+export SECRETS=/path/to/secrets_dir
+export OVERLAP=0
+
+# 1) Watershed atomic+composite
+export STAGE=ws
+/Users/weidf/Code/lib/abiss/scripts/run_batch.sh ws 3 0_0_0_0
+
+# 2) Watershed remap (writes to WS_PATH and CHUNKMAP_OUTPUT)
+export STAGE=ws
+/Users/weidf/Code/lib/abiss/scripts/remap_batch.sh ws 3 0_0_0_0
+
+# 3) Agglomeration atomic+composite (ME path)
+export STAGE=agg
+/Users/weidf/Code/lib/abiss/scripts/run_batch.sh me 3 0_0_0_0
+
+# 4) Agg remap (writes to SEG_PATH and size map)
+export STAGE=agg
+/Users/weidf/Code/lib/abiss/scripts/remap_batch.sh agg 3 0_0_0_0
+```
+
+## Outputs (High Level)
+
+- `ws` remap uploads chunked watershed segmentation to `WS_PATH`.
+- `ws` remap uploads chunkmaps to `CHUNKMAP_OUTPUT`.
+- `agg` remap uploads final segmentation to `SEG_PATH`.
+- `agg` remap uploads size map to `${SEG_PATH}/size_map`.
+
+## Common Failure Points
+
+- `STAGE` not set: scripts depend on it for pathing and remap logic.
+- `OVERLAP` not set for `me`: shell exits due to `set -u`.
+- `CHUNKMAP_OUTPUT` missing: watershed remap upload path becomes invalid.
+- Wrong `AFF_RESOLUTION` / bbox alignment: cutout and upload mismatches.
+- Missing cloud credentials or invalid `UPLOAD_CMD` / `DOWNLOAD_CMD`.
+
+## Optional Paths
+
+- Contact surface pipeline:
+  - `export STAGE=cs`
+  - `scripts/run_batch.sh cs <layers> <root_tag>`
+- Legacy `rlme` scripts exist, but they reference binaries (`ac`, `me`) not defined in current `CMakeLists.txt`; treat as legacy unless you add/build those tools.
diff --git a/connectomics/config/hydra_config.py b/connectomics/config/hydra_config.py
@@ -574,6 +574,7 @@ class OptimizationConfig:
     # Validation and logging
     val_check_interval: Union[int, float] = 1.0
     log_every_n_steps: int = 50
+    num_sanity_val_steps: int = 2
 
     optimizer: OptimizerConfig = field(default_factory=OptimizerConfig)
     scheduler: SchedulerConfig = field(default_factory=SchedulerConfig)
diff --git a/connectomics/training/lit/trainer.py b/connectomics/training/lit/trainer.py
@@ -275,6 +275,7 @@ def create_trainer(
     trainer = pl.Trainer(
         max_epochs=max_epochs,
         max_steps=max_steps,
+        num_sanity_val_steps=cfg.optimization.num_sanity_val_steps,
         accelerator="gpu" if use_gpu else "cpu",
         devices=system_cfg.num_gpus if use_gpu else 1,
         strategy=strategy,
diff --git a/connectomics/utils/debug_utils.py b/connectomics/utils/debug_utils.py
@@ -5,12 +5,14 @@
 the entire training pipeline without modifying training logic.
 """
 
+import os
 import torch
 import numpy as np
 from typing import Union, Optional
 
-# Global debug flag - set to True to enable debug prints
-DEBUG_NORM = True
+# Global debug flag. Enabled only when explicitly requested.
+# Set `PYTC_DEBUG_NORM=1` to turn on normalization debug prints.
+DEBUG_NORM = os.environ.get("PYTC_DEBUG_NORM", "0").lower() in {"1", "true", "yes", "on"}
 
 # Track which stages have been printed (to avoid spam)
 _printed_stages = set()
@@ -163,4 +165,3 @@ def print_normalization_check(
     "print_normalization_check",
     "reset_debug_state",
 ]
-
diff --git a/tutorials/mito_mitoEM_H.yaml b/tutorials/mito_mitoEM_H.yaml
@@ -34,6 +34,7 @@ data:
   - EM30-H/mito_val-v2.h5
 optimization:
   accumulate_grad_batches: 4
+  num_sanity_val_steps: 0
 monitor:
   logging:
     scalar: