Skip to content

Commit b7ea160

Browse files
committed
[DATALAD RUNCMD] yolo /speckit.implement
=== Do not change lines below === { "chain": [], "cmd": "yolo /speckit.implement", "exit": 0, "extra_inputs": [], "inputs": [], "outputs": [], "pwd": "." } ^^^ Do not change lines above ^^^
2 parents d9a88f8 + 94874bc commit b7ea160

39 files changed

Lines changed: 2167 additions & 204 deletions

.specify/specs/00-initial-design.md

Lines changed: 25 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -203,13 +203,13 @@ A dataset needs to be split — for example, extracting only behavioral data or
203203
- What happens when a rename creates a filename that exceeds OS path length limits?
204204
**Resolution**: Refuse with exit code 2 and a clear error. Covered by FR-011 (refuse invalid state). No extra task needed — implement as a guard in `rename_file()`.
205205
- How does the tool handle symlinked files (common with git-annex)?
206-
**Resolution**: `_vcs.py` GitAnnex backend handles this. Locked annexed files are symlinks; `git annex` commands operate on them correctly. Covered by T017-T018.
206+
**Resolution**: All file iteration code MUST treat symlinks as files (FR-023). `Path.is_file()` follows symlinks and returns `False` for annexed files without content — use `not path.is_dir()` instead. VCS operations (`git mv`, `git annex unlock/add`) handle symlinks correctly. Covered by T092.
207207
- What happens when `_scans.tsv` references files that don't exist on disk (dangling references)?
208208
**Resolution**: Warn but do not fail. Dangling references are a pre-existing dataset issue, not caused by bids-utils. Log at `-v` verbosity.
209209
- How does the tool handle partial datasets (e.g., missing `dataset_description.json`)?
210210
**Resolution**: `BIDSDataset.from_path()` raises an error if no `dataset_description.json` is found. Covered by T013-T014.
211211
- What happens when a file is locked by git-annex and content is needed for metadata operations?
212-
**Resolution**: For metadata read operations, content is needed. If locked and content unavailable, skip that file with a warning. Covered by T017-T018 (VCS tests should include locked-file scenario).
212+
**Resolution**: All file reads go through a content-aware I/O layer. The behavior is controlled by the `--annexed` policy option (FR-022): `error` (default, informative message), `get` (auto-fetch), `skip-warning`, or `skip`. The VCS backend provides `has_content()` and `get_content()` methods. Covered by T086-T091.
213213
- How does aggregation handle `.nwb` files that embed metadata internally?
214214
**Resolution**: Out of scope. bids-utils operates on BIDS sidecar metadata (`.json` files), not on embedded metadata within data files. NWB internal metadata is outside BIDS's inheritance model.
215215
- What happens when operating on a dataset on a read-only filesystem?
@@ -232,13 +232,29 @@ A dataset needs to be split — for example, extracting only behavioral data or
232232
- Q: Should `bids-utils completion` offer `--install` to modify shell rc files? → A: No; print activation script to stdout only (user handles installation).
233233
- Q: Which argument types get custom completions initially? → A: Filesystem-derived items (`sub-*`, `ses-*` directories, BIDS file paths) plus entity keys from the schema (`task=`, `run=`, `acq=`, etc.). Entity value discovery deferred.
234234

235+
### Session 2026-04-09
236+
237+
- Q: Where should the `--annexed` option live — per-command or group-level? → A: **Group-level** (`bids-utils --annexed=MODE COMMAND ...`). Every command that reads files is affected (rename reads sidecars, migrate reads JSON, session-rename reads `_scans.tsv`, metadata reads JSON). It's a dataset-level concern, not command-specific. Putting it on the group avoids repeating the option across ~10 commands. The policy flows through `BIDSDataset.annexed_mode` so library users get the same behavior.
238+
- Q: What modes should `--annexed` support? → A: `error` (default), `get`, `skip-warning`, `skip`. Environment variable `BIDS_UTILS_ANNEXED` for persistent preference.
239+
- Q: Should `dataset_description.json` reads be guarded by the annex policy? → A: No. This file is essentially never annexed (small JSON tracked in git). Adding annex awareness to `BIDSDataset.from_path()` creates a chicken-and-egg problem since the dataset object doesn't exist yet.
240+
- Q: Should content fetching be batched? → A: Initial implementation does per-file checks/fetches. Batch optimization (`ensure_content_batch`) can be added later for scan-heavy operations (migrate, metadata audit).
241+
- Q: What about writing to annexed files? → A: Annexed files in locked mode (symlinks to `.git/annex/objects`) are read-only. Before modification, `unlock(paths)` must be called (`git annex unlock` / `datalad unlock`). After modification, `add(paths)` must be called (`git annex add`) to re-annex the file. The I/O layer provides `ensure_writable()` (unlock) and `mark_modified()` (add) to bracket writes. The full lifecycle for a modify operation on an annexed file is: get → unlock → read → modify → write → add.
242+
- Q: Should `unlock`/`add` be implicit or require `--annexed=get`? → A: `unlock` and `add` apply whenever the VCS is git-annex/DataLad, regardless of `--annexed` mode. The `--annexed` mode only controls what happens when content is *missing*. If content is present but the file is locked, any write operation must unlock first — this is a VCS-level concern, not a policy choice.
243+
244+
### Session 2026-04-10
245+
246+
- Q: Should `--dry-run` show every file operation or just a summary? → A: Both. `--dry-run` (no value or `--dry-run=overview`) shows the current summary view (one line per subject/session). `--dry-run=detailed` lists every individual file rename, file edit, and `_scans.tsv` update. The detailed mode is what users need to verify correctness before committing. The overview mode remains the default for quick checks.
247+
- Q: How should annexed content operations be logged? → A: When `--annexed=get` fetches content, log each file fetched at normal verbosity. In `--dry-run` mode, report which files *would* need content fetched. At `-v`, also log `unlock` and `add` operations.
248+
- **BUG**: `session.py` and `subject.py` use `Path.is_file()` to filter files for renaming, but `is_file()` follows symlinks — returning `False` for annexed files without local content (broken symlinks into `.git/annex/objects`). This means **annexed data files (`.nii.gz`, etc.) are silently skipped during rename**. The fix: use `not path.is_dir()` or `path.is_file() or path.is_symlink()` everywhere that iterates over files for processing. This affects `session.py`, `subject.py`, `run.py`, `split.py`, `merge.py`, `_sidecars.py`, and `migrate.py`. All existing tests missed this because they use `tmp_path` fixtures with real files, never symlinks.
249+
- Q: Why didn't the `bids-examples` integration tests catch the symlink bug? → A: `bids-examples` datasets contain regular files, not annexed symlinks. Integration tests need a fixture that creates a git-annex repo with locked (symlinked) files to exercise this path. Add a `tmp_annex_dataset` fixture.
250+
235251
## Requirements *(mandatory)*
236252

237253
### Functional Requirements
238254

239255
- **FR-001**: System MUST provide a Python library (`bids_utils`) with a clean, importable public API. Every CLI command maps to a library function.
240256
- **FR-002**: System MUST provide a CLI (`bids-utils`) as a thin wrapper over the library API.
241-
- **FR-003**: Every mutating command MUST support `--dry-run` / `-n` mode showing exactly what would change without modifying any files.
257+
- **FR-003**: Every mutating command MUST support `--dry-run` / `-n` mode showing exactly what would change without modifying any files. `--dry-run` (or `--dry-run=overview`) shows a summary view; `--dry-run=detailed` lists every individual file operation (rename, edit, content fetch). SC-002 applies to the detailed mode.
242258
- **FR-004**: System MUST detect and use VCS (git, git-annex, DataLad) when present — `git mv` instead of `os.rename`, etc. When no VCS is detected, operate directly on filesystem.
243259
- **FR-005**: System MUST update `_scans.tsv` entries whenever referenced files are renamed or removed.
244260
- **FR-006**: System MUST update `participants.tsv` when subjects are renamed or removed.
@@ -257,6 +273,9 @@ A dataset needs to be split — for example, extracting only behavioral data or
257273
- **FR-019**: System MUST provide a `bids-utils completion [SHELL]` subcommand that outputs shell completion activation scripts. When `SHELL` argument is omitted, auto-detect from the `$SHELL` environment variable. Supported shells: Bash, Zsh, Fish (matching Click 8.0+ built-in completion support). Output goes to stdout only (no `--install` flag).
258274
- **FR-020**: CLI MUST resolve the BIDS dataset root by: (1) using the `--dataset`/`-d` flag if provided, or (2) walking up the directory hierarchy from CWD until `dataset_description.json` is found. This resolution is used both by commands and by shell completion.
259275
- **FR-021**: Shell completion MUST provide BIDS-aware completions: filesystem-derived items (`sub-*` directories, `ses-*` directories, BIDS file paths) and entity keys from the `bidsschematools` schema (e.g., `task=`, `run=`, `acq=`). Entity value completion (e.g., `task=rest`) is deferred to a later release.
276+
- **FR-023**: All code that iterates over files MUST treat symlinks as files (not skip them). Use `not path.is_dir()` or `path.is_file() or path.is_symlink()` instead of bare `path.is_file()`. This is critical for git-annex datasets where data files are symlinks to `.git/annex/objects`.
277+
- **FR-024**: Annexed content operations (get, unlock, add) MUST be logged. At normal verbosity, log each file fetched by `--annexed=get`. In `--dry-run` mode, report files that would need content fetched. At `-v`, also log unlock/add operations. This gives users visibility into what the annex layer is doing.
278+
- **FR-022**: System MUST provide a group-level `--annexed` option controlling behavior when git-annex/DataLad file content is not locally available. Modes: `error` (default — informative error listing missing files and suggesting `--annexed=get` or `git annex get`), `get` (automatically fetch content via `git annex get` / `datalad get` before reading), `skip-warning` (skip files without content with a per-file warning), `skip` (skip silently). The option MUST also be settable via `BIDS_UTILS_ANNEXED` environment variable (CLI flag takes precedence). The VCS backend protocol MUST expose: `has_content(path)` and `get_content(paths)` for reads; `unlock(paths)` to make locked annexed files writable before modification; `add(paths)` to re-annex modified files after writes (restoring them to their original tracked state). All file reads (TSV, JSON sidecars) MUST go through a content-aware I/O layer. All file writes to potentially-annexed files MUST go through an unlock-before/add-after lifecycle managed by the I/O layer.
260279

261280
### Key Entities
262281

@@ -272,7 +291,8 @@ A dataset needs to be split — for example, extracting only behavioral data or
272291
### Measurable Outcomes
273292

274293
- **SC-001**: Every bids-examples dataset that is valid before a `rename`/`subject-rename`/`session-rename` operation is still valid after the operation completes.
275-
- **SC-002**: `--dry-run` output for every command matches the actual changes when run without `--dry-run` (verified by comparing dry-run output to actual filesystem diff).
294+
- **SC-002**: `--dry-run=detailed` output for every command matches the actual changes when run without `--dry-run` (verified by comparing dry-run output to actual filesystem diff). `--dry-run=overview` provides a human-friendly summary.
295+
- **SC-008**: All file-renaming operations (session-rename, subject-rename, rename) correctly handle git-annex symlinks — verified by tests using a `tmp_annex_dataset` fixture with locked annexed files.
276296
- **SC-003**: All commands complete on a 1000-subject dataset in O(n) time relative to affected files (not O(n²) in total dataset size). Single-entity operations (rename, remove-run) must not scan the entire dataset. Benchmark target: `rename` on a single file in a 1000-subject dataset completes in under 5 seconds.
277297
- **SC-004**: Library API is independently usable: all acceptance scenarios can be executed via Python imports without the CLI.
278298
- **SC-005**: 100% of mutating commands have both `--dry-run` and `--json` modes tested in CI.
@@ -284,7 +304,7 @@ A dataset needs to be split — for example, extracting only behavioral data or
284304
- Users have Python 3.10+ installed (aligned with current ecosystem support).
285305
- `bidsschematools` provides stable, versioned access to the BIDS schema. If its API changes, bids-utils will adapt.
286306
- The BIDS validator (`bids-validator-deno`) is available for integration testing but is not a runtime dependency.
287-
- Datasets fit on local disk for direct operations. Remote/annexed access is a separate concern handled via fsspec/git-annex passthrough.
307+
- Datasets fit on local disk for direct operations. Annexed files without local content are handled via `--annexed` policy (FR-022): error by default, with auto-fetch and skip modes.
288308
- The initial release focuses on local filesystem operations. Full DataLad integration (provenance via `datalad run`) is a subsequent enhancement.
289309
- `bids-examples` git repository is available as a submodule or fixture for testing.
290310
- The project uses `uv` for package management, `tox` + `tox-uv` for test orchestration, `ruff` for linting, `mypy` for type checking, `mkdocs` for documentation — as stated in the constitution.

.specify/specs/00-initial-design/contracts/library-api.md

Lines changed: 52 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -10,6 +10,7 @@
1010
class BIDSDataset:
1111
root: Path
1212
bids_version: str
13+
annexed_mode: AnnexedMode = AnnexedMode.ERROR
1314

1415
@classmethod
1516
def from_path(cls, path: str | Path) -> BIDSDataset:
@@ -160,10 +161,59 @@ def merge_datasets(
160161
"""Merge multiple BIDS datasets."""
161162
```
162163

164+
### `bids_utils._vcs.VCSBackend` (Protocol)
165+
166+
```python
167+
class VCSBackend(Protocol):
168+
name: str
169+
170+
# Existing operations
171+
def move(self, src: Path, dst: Path) -> None: ...
172+
def remove(self, path: Path) -> None: ...
173+
def is_dirty(self) -> bool: ...
174+
def commit(self, message: str, paths: list[Path]) -> None: ...
175+
176+
# Content availability (FR-022)
177+
def has_content(self, path: Path) -> bool: ...
178+
def get_content(self, paths: list[Path]) -> None: ...
179+
180+
# Write lifecycle for annexed files (FR-022)
181+
def unlock(self, paths: list[Path]) -> None: ...
182+
def add(self, paths: list[Path]) -> None: ...
183+
```
184+
185+
| Backend | `has_content` | `get_content` | `unlock` | `add` |
186+
|-----------|-----------------------|---------------------|-----------------------|---------------------|
187+
| NoVCS | always `True` | no-op | no-op | no-op |
188+
| Git | always `True` | no-op | no-op | `git add` |
189+
| GitAnnex | symlink target exists | `git annex get` | `git annex unlock` | `git annex add` |
190+
| DataLad | symlink target exists | `datalad get` | `datalad unlock` | `git annex add` |
191+
192+
### `bids_utils._io` (Content-aware I/O)
193+
194+
```python
195+
def ensure_content(path: Path, vcs: VCSBackend, mode: AnnexedMode) -> None:
196+
"""Ensure file content is available for reading. Enforces --annexed policy."""
197+
198+
def ensure_writable(path: Path, vcs: VCSBackend) -> None:
199+
"""Unlock annexed file if locked (symlink to .git/annex/objects).
200+
Always applied for GitAnnex/DataLad, regardless of --annexed mode."""
201+
202+
def mark_modified(paths: list[Path], vcs: VCSBackend) -> None:
203+
"""Re-annex files after modification (git annex add).
204+
Always applied for GitAnnex/DataLad, regardless of --annexed mode."""
205+
206+
def read_json(path: Path, vcs: VCSBackend, mode: AnnexedMode) -> dict | None:
207+
"""Read JSON with content-awareness. Returns None if skipped."""
208+
```
209+
163210
## CLI Contract
164211

165-
All commands follow this pattern:
166-
- `--dry-run` / `-n`: Show what would change without modifying
212+
Group-level options (before the command):
213+
- `--annexed MODE`: How to handle git-annex files without local content. Modes: `error` (default), `get`, `skip-warning`, `skip`. Also settable via `BIDS_UTILS_ANNEXED` env var.
214+
215+
Per-command common options:
216+
- `--dry-run` / `-n`: Show what would change without modifying. Accepts optional value: `overview` (default, summary) or `detailed` (every file operation listed).
167217
- `--json`: Machine-readable JSON output
168218
- `-v` / `-q`: Verbosity control
169219
- `--force`: Skip confirmation on destructive operations

0 commit comments

Comments
 (0)