Skip to content

Commit 0124239

Browse files
author
Donglai Wei
committed
Diagnose stalled mito training
1 parent 0b264eb commit 0124239

5 files changed

Lines changed: 158 additions & 3 deletions

File tree

ABISS_USAGE_SUMMARY.md

Lines changed: 151 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,151 @@
1+
# ABISS Usage Summary
2+
3+
This note summarizes how to build and run ABISS from `/Users/weidf/Code/lib/abiss`, based on the current scripts and source in that repo.
4+
5+
## What ABISS Is
6+
7+
ABISS (Affinity Based Image Segmentation System) is a chunked 3D segmentation pipeline:
8+
9+
1. Watershed over affinity maps (`ws` stage).
10+
2. Agglomeration/merge of watershed fragments (`agg` stage, typically `me` op).
11+
3. Optional contact-surface extraction (`cs` op).
12+
13+
Pipeline orchestration is driven by shell scripts in `/Users/weidf/Code/lib/abiss/scripts`.
14+
15+
## Build
16+
17+
From the ABISS repo:
18+
19+
```bash
20+
cd /Users/weidf/Code/lib/abiss
21+
mkdir -p build
22+
cd build
23+
cmake ..
24+
make -j"$(nproc)"
25+
```
26+
27+
Key binaries produced in `build/`:
28+
29+
- `ws`, `ws2`, `ws3`
30+
- `acme`, `meme`, `agg`, `agg_nonoverlap`, `agg_overlap`, `agg_extra`
31+
- `split_remap`, `match_chunks`, `reduce_chunk`, `size_map`, `evaluate`
32+
- `accs`, `mecs`, `assort`
33+
34+
## Runtime Layout and Entry Scripts
35+
36+
Primary entrypoints:
37+
38+
- `scripts/run_batch.sh <op> <num_composite_layers> <root_tag>`
39+
- `scripts/remap_batch.sh <op> <num_composite_layers_unused> <root_tag>`
40+
41+
Where:
42+
43+
- `<op>` maps to script names:
44+
- `ws` -> `atomic_chunk_ws.sh`, `composite_chunk_ws.sh`, `remap_chunk_ws.sh`
45+
- `me` -> `atomic_chunk_me.sh`, `composite_chunk_me.sh`, `remap_chunk_agg.sh`
46+
- `cs` -> `atomic_chunk_cs.sh`, `composite_chunk_cs.sh` (no remap script in batch wrapper)
47+
- `<root_tag>` is chunk tag format: `mip_x_y_z` (example: `0_0_0_0`).
48+
49+
## Required Environment Conventions
50+
51+
`scripts/init.sh` expects:
52+
53+
- `WORKER_HOME` (defaults to `/workspace/seg`)
54+
- `SECRETS` directory containing a parameter JSON file named `param`
55+
- `PARAM_JSON` is set to `$SECRETS/param`
56+
- `STAGE` must be exported by caller (`ws`, `agg`, optionally `cs`)
57+
- For `me` scripts, `OVERLAP` must be exported (`0`, `1`, or `2`)
58+
- `atomic_chunk_me.sh` and `composite_chunk_me.sh` use `set -u`, so unset `OVERLAP` will fail.
59+
60+
`init.sh` auto-generates `$SECRETS/config.sh` by running `scripts/set_env.py $PARAM_JSON` (once), then sources it.
61+
62+
## Parameter JSON (`$SECRETS/param`)
63+
64+
### Core keys (practically required)
65+
66+
- `NAME`
67+
- `BBOX`
68+
- `CHUNK_SIZE`
69+
- `AFF_PATH`
70+
- `AFF_RESOLUTION`
71+
- `WS_HIGH_THRESHOLD`
72+
- `WS_LOW_THRESHOLD`
73+
- `WS_SIZE_THRESHOLD`
74+
- `AGG_THRESHOLD`
75+
- `SCRATCH_PATH`
76+
- `WS_PREFIX`, `SEG_PREFIX` (or explicit `WS_PATH`, `SEG_PATH`)
77+
78+
### Highly recommended / stage-specific
79+
80+
- `CHUNKMAP_OUTPUT` for watershed remap output upload
81+
- `CHUNKMAP_INPUT` (optional; defaults to `${SCRATCH_PATH}/ws/chunkmap`)
82+
- `WS_DUST_THRESHOLD` (defaults to `WS_SIZE_THRESHOLD`)
83+
- `REMAP_SIZE_MAP_THRESHOLD` (defaults to `100000`)
84+
- `SEM_PATH`, `SEMANTIC_WS`, `SEM_FILL_MISSING`
85+
- `AFF_FILL_MISSING`, `WS_FILL_MISSING`, `SEG_FILL_MISSING`
86+
- `GT_PATH`, `CLEFT_PATH` (used by eval path in remap agg)
87+
- `CHUNKED_AGG_OUTPUT`, `CHUNKED_SEG_PATH`
88+
- `UPLOAD_CMD`, `DOWNLOAD_CMD` (auto-derived if missing)
89+
- `REDIS_SERVER`, `REDIS_DB` (task state tracking; otherwise fallback to scratch `done/` files)
90+
91+
### Affinity channel expectations
92+
93+
In `cut_chunk_common.py`:
94+
95+
- 1 channel: interpreted as probability map and converted to 3 affinities.
96+
- 3 channels: used as affinities.
97+
- 4 channels: first 3 affinity + 1 myelin.
98+
- N channels: first `AFF_CHANNELS` channels (default `3`).
99+
100+
## Minimal End-to-End Run Sequence
101+
102+
Assuming:
103+
104+
- ABISS repo at `/Users/weidf/Code/lib/abiss`
105+
- your param JSON available at `$SECRETS/param`
106+
- root chunk tag `0_0_0_0`
107+
- composite layers `3` (example)
108+
109+
```bash
110+
export WORKER_HOME=/Users/weidf/Code/lib/abiss
111+
export SECRETS=/path/to/secrets_dir
112+
export OVERLAP=0
113+
114+
# 1) Watershed atomic+composite
115+
export STAGE=ws
116+
/Users/weidf/Code/lib/abiss/scripts/run_batch.sh ws 3 0_0_0_0
117+
118+
# 2) Watershed remap (writes to WS_PATH and CHUNKMAP_OUTPUT)
119+
export STAGE=ws
120+
/Users/weidf/Code/lib/abiss/scripts/remap_batch.sh ws 3 0_0_0_0
121+
122+
# 3) Agglomeration atomic+composite (ME path)
123+
export STAGE=agg
124+
/Users/weidf/Code/lib/abiss/scripts/run_batch.sh me 3 0_0_0_0
125+
126+
# 4) Agg remap (writes to SEG_PATH and size map)
127+
export STAGE=agg
128+
/Users/weidf/Code/lib/abiss/scripts/remap_batch.sh agg 3 0_0_0_0
129+
```
130+
131+
## Outputs (High Level)
132+
133+
- `ws` remap uploads chunked watershed segmentation to `WS_PATH`.
134+
- `ws` remap uploads chunkmaps to `CHUNKMAP_OUTPUT`.
135+
- `agg` remap uploads final segmentation to `SEG_PATH`.
136+
- `agg` remap uploads size map to `${SEG_PATH}/size_map`.
137+
138+
## Common Failure Points
139+
140+
- `STAGE` not set: scripts depend on it for pathing and remap logic.
141+
- `OVERLAP` not set for `me`: shell exits due to `set -u`.
142+
- `CHUNKMAP_OUTPUT` missing: watershed remap upload path becomes invalid.
143+
- Wrong `AFF_RESOLUTION` / bbox alignment: cutout and upload mismatches.
144+
- Missing cloud credentials or invalid `UPLOAD_CMD` / `DOWNLOAD_CMD`.
145+
146+
## Optional Paths
147+
148+
- Contact surface pipeline:
149+
- `export STAGE=cs`
150+
- `scripts/run_batch.sh cs <layers> <root_tag>`
151+
- Legacy `rlme` scripts exist, but they reference binaries (`ac`, `me`) not defined in current `CMakeLists.txt`; treat as legacy unless you add/build those tools.

connectomics/config/hydra_config.py

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -574,6 +574,7 @@ class OptimizationConfig:
574574
# Validation and logging
575575
val_check_interval: Union[int, float] = 1.0
576576
log_every_n_steps: int = 50
577+
num_sanity_val_steps: int = 2
577578

578579
optimizer: OptimizerConfig = field(default_factory=OptimizerConfig)
579580
scheduler: SchedulerConfig = field(default_factory=SchedulerConfig)

connectomics/training/lit/trainer.py

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -275,6 +275,7 @@ def create_trainer(
275275
trainer = pl.Trainer(
276276
max_epochs=max_epochs,
277277
max_steps=max_steps,
278+
num_sanity_val_steps=cfg.optimization.num_sanity_val_steps,
278279
accelerator="gpu" if use_gpu else "cpu",
279280
devices=system_cfg.num_gpus if use_gpu else 1,
280281
strategy=strategy,

connectomics/utils/debug_utils.py

Lines changed: 4 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -5,12 +5,14 @@
55
the entire training pipeline without modifying training logic.
66
"""
77

8+
import os
89
import torch
910
import numpy as np
1011
from typing import Union, Optional
1112

12-
# Global debug flag - set to True to enable debug prints
13-
DEBUG_NORM = True
13+
# Global debug flag. Enabled only when explicitly requested.
14+
# Set `PYTC_DEBUG_NORM=1` to turn on normalization debug prints.
15+
DEBUG_NORM = os.environ.get("PYTC_DEBUG_NORM", "0").lower() in {"1", "true", "yes", "on"}
1416

1517
# Track which stages have been printed (to avoid spam)
1618
_printed_stages = set()
@@ -163,4 +165,3 @@ def print_normalization_check(
163165
"print_normalization_check",
164166
"reset_debug_state",
165167
]
166-

tutorials/mito_mitoEM_H.yaml

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -34,6 +34,7 @@ data:
3434
- EM30-H/mito_val-v2.h5
3535
optimization:
3636
accumulate_grad_batches: 4
37+
num_sanity_val_steps: 0
3738
monitor:
3839
logging:
3940
scalar:

0 commit comments

Comments
 (0)