Skip to content

Commit f3824b9

Browse files
yarikopticclaude
andcommitted
Fix symlink bug, add --dry-run=detailed, annex logging (Phase 1c: T092-T098)
BUG FIX: Replace is_file() with not is_dir() for file iteration in session.py, subject.py, run.py, split.py, merge.py, _sidecars.py, migrate.py. Path.is_file() follows symlinks and returns False for annexed files without content, silently skipping them during rename. ENHANCEMENT: --dry-run now accepts optional value: --dry-run (overview, default) shows summary; --dry-run=detailed lists every file operation. Session rename now enumerates per-file changes before the dry_run check so both modes have full information. Overview mode filters out indented detail lines; detailed mode shows "action: source → target" per file. ENHANCEMENT: Annex operations logged via Python logging — INFO for content fetches (--annexed=get), DEBUG for unlock/add. CLI verbosity (-v, -q) wired to logging levels. Add tmp_annex_dataset fixture (git-annex repo with locked symlinks). 6 regression tests verify all files (including symlinks) renamed correctly. 8 tests for --dry-run modes and annex log messages. 1297 tests pass across py310-py314, lint, type, duplication. Co-Authored-By: Claude Code 2.1.98 / Claude Opus 4.6 <noreply@anthropic.com> Entire-Checkpoint: c963bb384041
1 parent e90e7ad commit f3824b9

24 files changed

Lines changed: 634 additions & 44 deletions

.specify/specs/00-initial-design.md

Lines changed: 13 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -203,7 +203,7 @@ A dataset needs to be split — for example, extracting only behavioral data or
203203
- What happens when a rename creates a filename that exceeds OS path length limits?
204204
**Resolution**: Refuse with exit code 2 and a clear error. Covered by FR-011 (refuse invalid state). No extra task needed — implement as a guard in `rename_file()`.
205205
- How does the tool handle symlinked files (common with git-annex)?
206-
**Resolution**: `_vcs.py` GitAnnex backend handles this. Locked annexed files are symlinks; `git annex` commands operate on them correctly. Covered by T017-T018.
206+
**Resolution**: All file iteration code MUST treat symlinks as files (FR-023). `Path.is_file()` follows symlinks and returns `False` for annexed files without content — use `not path.is_dir()` instead. VCS operations (`git mv`, `git annex unlock/add`) handle symlinks correctly. Covered by T092.
207207
- What happens when `_scans.tsv` references files that don't exist on disk (dangling references)?
208208
**Resolution**: Warn but do not fail. Dangling references are a pre-existing dataset issue, not caused by bids-utils. Log at `-v` verbosity.
209209
- How does the tool handle partial datasets (e.g., missing `dataset_description.json`)?
@@ -241,13 +241,20 @@ A dataset needs to be split — for example, extracting only behavioral data or
241241
- Q: What about writing to annexed files? → A: Annexed files in locked mode (symlinks to `.git/annex/objects`) are read-only. Before modification, `unlock(paths)` must be called (`git annex unlock` / `datalad unlock`). After modification, `add(paths)` must be called (`git annex add`) to re-annex the file. The I/O layer provides `ensure_writable()` (unlock) and `mark_modified()` (add) to bracket writes. The full lifecycle for a modify operation on an annexed file is: get → unlock → read → modify → write → add.
242242
- Q: Should `unlock`/`add` be implicit or require `--annexed=get`? → A: `unlock` and `add` apply whenever the VCS is git-annex/DataLad, regardless of `--annexed` mode. The `--annexed` mode only controls what happens when content is *missing*. If content is present but the file is locked, any write operation must unlock first — this is a VCS-level concern, not a policy choice.
243243

244+
### Session 2026-04-10
245+
246+
- Q: Should `--dry-run` show every file operation or just a summary? → A: Both. `--dry-run` (no value or `--dry-run=overview`) shows the current summary view (one line per subject/session). `--dry-run=detailed` lists every individual file rename, file edit, and `_scans.tsv` update. The detailed mode is what users need to verify correctness before committing. The overview mode remains the default for quick checks.
247+
- Q: How should annexed content operations be logged? → A: When `--annexed=get` fetches content, log each file fetched at normal verbosity. In `--dry-run` mode, report which files *would* need content fetched. At `-v`, also log `unlock` and `add` operations.
248+
- **BUG**: `session.py` and `subject.py` use `Path.is_file()` to filter files for renaming, but `is_file()` follows symlinks — returning `False` for annexed files without local content (broken symlinks into `.git/annex/objects`). This means **annexed data files (`.nii.gz`, etc.) are silently skipped during rename**. The fix: use `not path.is_dir()` or `path.is_file() or path.is_symlink()` everywhere that iterates over files for processing. This affects `session.py`, `subject.py`, `run.py`, `split.py`, `merge.py`, `_sidecars.py`, and `migrate.py`. All existing tests missed this because they use `tmp_path` fixtures with real files, never symlinks.
249+
- Q: Why didn't the `bids-examples` integration tests catch the symlink bug? → A: `bids-examples` datasets contain regular files, not annexed symlinks. Integration tests need a fixture that creates a git-annex repo with locked (symlinked) files to exercise this path. Add a `tmp_annex_dataset` fixture.
250+
244251
## Requirements *(mandatory)*
245252

246253
### Functional Requirements
247254

248255
- **FR-001**: System MUST provide a Python library (`bids_utils`) with a clean, importable public API. Every CLI command maps to a library function.
249256
- **FR-002**: System MUST provide a CLI (`bids-utils`) as a thin wrapper over the library API.
250-
- **FR-003**: Every mutating command MUST support `--dry-run` / `-n` mode showing exactly what would change without modifying any files.
257+
- **FR-003**: Every mutating command MUST support `--dry-run` / `-n` mode showing exactly what would change without modifying any files. `--dry-run` (or `--dry-run=overview`) shows a summary view; `--dry-run=detailed` lists every individual file operation (rename, edit, content fetch). SC-002 applies to the detailed mode.
251258
- **FR-004**: System MUST detect and use VCS (git, git-annex, DataLad) when present — `git mv` instead of `os.rename`, etc. When no VCS is detected, operate directly on filesystem.
252259
- **FR-005**: System MUST update `_scans.tsv` entries whenever referenced files are renamed or removed.
253260
- **FR-006**: System MUST update `participants.tsv` when subjects are renamed or removed.
@@ -266,6 +273,8 @@ A dataset needs to be split — for example, extracting only behavioral data or
266273
- **FR-019**: System MUST provide a `bids-utils completion [SHELL]` subcommand that outputs shell completion activation scripts. When `SHELL` argument is omitted, auto-detect from the `$SHELL` environment variable. Supported shells: Bash, Zsh, Fish (matching Click 8.0+ built-in completion support). Output goes to stdout only (no `--install` flag).
267274
- **FR-020**: CLI MUST resolve the BIDS dataset root by: (1) using the `--dataset`/`-d` flag if provided, or (2) walking up the directory hierarchy from CWD until `dataset_description.json` is found. This resolution is used both by commands and by shell completion.
268275
- **FR-021**: Shell completion MUST provide BIDS-aware completions: filesystem-derived items (`sub-*` directories, `ses-*` directories, BIDS file paths) and entity keys from the `bidsschematools` schema (e.g., `task=`, `run=`, `acq=`). Entity value completion (e.g., `task=rest`) is deferred to a later release.
276+
- **FR-023**: All code that iterates over files MUST treat symlinks as files (not skip them). Use `not path.is_dir()` or `path.is_file() or path.is_symlink()` instead of bare `path.is_file()`. This is critical for git-annex datasets where data files are symlinks to `.git/annex/objects`.
277+
- **FR-024**: Annexed content operations (get, unlock, add) MUST be logged. At normal verbosity, log each file fetched by `--annexed=get`. In `--dry-run` mode, report files that would need content fetched. At `-v`, also log unlock/add operations. This gives users visibility into what the annex layer is doing.
269278
- **FR-022**: System MUST provide a group-level `--annexed` option controlling behavior when git-annex/DataLad file content is not locally available. Modes: `error` (default — informative error listing missing files and suggesting `--annexed=get` or `git annex get`), `get` (automatically fetch content via `git annex get` / `datalad get` before reading), `skip-warning` (skip files without content with a per-file warning), `skip` (skip silently). The option MUST also be settable via `BIDS_UTILS_ANNEXED` environment variable (CLI flag takes precedence). The VCS backend protocol MUST expose: `has_content(path)` and `get_content(paths)` for reads; `unlock(paths)` to make locked annexed files writable before modification; `add(paths)` to re-annex modified files after writes (restoring them to their original tracked state). All file reads (TSV, JSON sidecars) MUST go through a content-aware I/O layer. All file writes to potentially-annexed files MUST go through an unlock-before/add-after lifecycle managed by the I/O layer.
270279

271280
### Key Entities
@@ -282,7 +291,8 @@ A dataset needs to be split — for example, extracting only behavioral data or
282291
### Measurable Outcomes
283292

284293
- **SC-001**: Every bids-examples dataset that is valid before a `rename`/`subject-rename`/`session-rename` operation is still valid after the operation completes.
285-
- **SC-002**: `--dry-run` output for every command matches the actual changes when run without `--dry-run` (verified by comparing dry-run output to actual filesystem diff).
294+
- **SC-002**: `--dry-run=detailed` output for every command matches the actual changes when run without `--dry-run` (verified by comparing dry-run output to actual filesystem diff). `--dry-run=overview` provides a human-friendly summary.
295+
- **SC-008**: All file-renaming operations (session-rename, subject-rename, rename) correctly handle git-annex symlinks — verified by tests using a `tmp_annex_dataset` fixture with locked annexed files.
286296
- **SC-003**: All commands complete on a 1000-subject dataset in O(n) time relative to affected files (not O(n²) in total dataset size). Single-entity operations (rename, remove-run) must not scan the entire dataset. Benchmark target: `rename` on a single file in a 1000-subject dataset completes in under 5 seconds.
287297
- **SC-004**: Library API is independently usable: all acceptance scenarios can be executed via Python imports without the CLI.
288298
- **SC-005**: 100% of mutating commands have both `--dry-run` and `--json` modes tested in CI.

.specify/specs/00-initial-design/contracts/library-api.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -213,7 +213,7 @@ Group-level options (before the command):
213213
- `--annexed MODE`: How to handle git-annex files without local content. Modes: `error` (default), `get`, `skip-warning`, `skip`. Also settable via `BIDS_UTILS_ANNEXED` env var.
214214

215215
Per-command common options:
216-
- `--dry-run` / `-n`: Show what would change without modifying
216+
- `--dry-run` / `-n`: Show what would change without modifying. Accepts optional value: `overview` (default, summary) or `detailed` (every file operation listed).
217217
- `--json`: Machine-readable JSON output
218218
- `-v` / `-q`: Verbosity control
219219
- `--force`: Skip confirmation on destructive operations

.specify/specs/00-initial-design/plan.md

Lines changed: 14 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -168,6 +168,20 @@ tests/
168168

169169
**Dependencies**: Phase 1 complete. Can be done at any point after Phase 1, but should be done before real-world usage on annexed datasets.
170170

171+
### Phase 1c: Symlink Safety & Dry-Run Detail (FR-003, FR-023, FR-024)
172+
173+
**Goal**: Fix the `is_file()` symlink bug that silently skips annexed data files during rename operations. Enhance `--dry-run` to show per-file detail. Add annex operation logging.
174+
175+
**Steps**:
176+
1. **Symlink bug fix (T092)**: Audit all `is_file()` calls used for file iteration. Replace with `not path.is_dir()` in `session.py`, `subject.py`, `run.py`, `split.py`, `merge.py`, `_sidecars.py`, `migrate.py`. Keep `is_file()` where checking for file existence (not iteration).
177+
2. **Annex test fixture (T093)**: `tmp_annex_dataset` in conftest.py — git-annex repo with locked symlinks alongside regular files.
178+
3. **Regression tests (T094)**: Session/subject/file rename on annexed dataset — verify all files including symlinks are renamed.
179+
4. **Dry-run detail (T095-T096)**: `--dry-run=overview|detailed`. Update `common_options`, ensure all library functions populate per-file `Change` entries. `output_result` renders overview vs detailed.
180+
5. **Annex logging (T097)**: INFO-level logging for get/unlock/add operations in `_io.py`.
181+
6. **Tests (T098)**: Verify `--dry-run=detailed` output.
182+
183+
**Dependencies**: Phase 1b complete. BLOCKS real-world usage on annexed datasets.
184+
171185
### Phase 2: File Rename (Story 1 — P1)
172186

173187
**Goal**: `bids-utils rename` working end-to-end with full test coverage.

.specify/specs/00-initial-design/tasks.md

Lines changed: 28 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -302,6 +302,33 @@
302302

303303
---
304304

305+
## Phase 1c: Symlink Safety & Dry-Run Detail (FR-003, FR-023, FR-024)
306+
307+
**Purpose**: Fix critical git-annex symlink handling bug and enhance `--dry-run` to show per-file detail. These are blocking issues for real-world usage on annexed datasets.
308+
309+
### Bug fix: `is_file()` skips annexed symlinks (FR-023)
310+
311+
- [X] T092 Replace all bare `path.is_file()` calls used for file iteration with `not path.is_dir()` (or `path.is_file() or path.is_symlink()`) in: `session.py` (2 sites), `subject.py` (2 sites), `run.py` (2 sites), `split.py` (1 site), `merge.py` (1 site), `_sidecars.py` (1 site), `migrate.py` (1 site). Preserve `is_file()` where semantically correct (e.g., `_dataset.py` checking `dataset_description.json` existence, `_scans.py` checking `_scans.tsv` existence — these are never annexed).
312+
- [X] T093 Add `tmp_annex_dataset` pytest fixture in `tests/conftest.py`: creates a git-annex repo with locked (symlinked) data files (`.nii.gz`) alongside regular git files (`.json`, `.tsv`). Requires `git annex` to be installed (mark tests `skipif` otherwise).
313+
- [X] T094 Write regression tests using `tmp_annex_dataset` for session-rename, subject-rename, and rename — verify that ALL files (including annexed symlinks) are renamed correctly (SC-008). Test both with content present and content absent.
314+
315+
### Enhanced dry-run (FR-003 update)
316+
317+
- [X] T095 Change `--dry-run` / `-n` from a boolean flag to an optional-value option: `--dry-run` (or `--dry-run=overview`) for current summary behavior, `--dry-run=detailed` for per-file listing. Update `common_options` in `cli/_common.py`, `OperationResult`, and `output_result()`. Library functions already populate `result.changes` with per-file detail — the change is in how `output_result` renders them.
318+
- [X] T096 Ensure all library functions populate `result.changes` with per-file detail (not just one summary `Change` per subject/session). Audit `session.py`, `subject.py`, `rename.py` — the rename function already does this; session/subject need to add per-file `Change` entries for individual file renames within the session/subject operation.
319+
320+
### Annex operation logging (FR-024)
321+
322+
- [X] T097 Add logging to `_io.py` for annex operations: log at INFO level when `ensure_content` fetches a file (`--annexed=get`), when `ensure_writable` unlocks, when `mark_modified` re-adds. In `--dry-run` mode, report which files would need content fetched. Wire through to CLI verbosity (`-v` enables DEBUG, default shows INFO, `-q` suppresses).
323+
324+
### Tests
325+
326+
- [X] T098 Write tests for `--dry-run=detailed` output: verify per-file change listing for session-rename, subject-rename, rename. Verify `--dry-run=overview` retains current behavior. Verify `--dry-run` without value defaults to overview.
327+
328+
**Checkpoint**: `bids-utils --annexed=get session-rename --dry-run=detailed` shows every file that would be renamed/edited/fetched. Running without `--dry-run` on an annexed dataset correctly renames all files including symlinks.
329+
330+
---
331+
305332
## Phase 12: Polish & Cross-Cutting Concerns
306333

307334
**Purpose**: Improvements that affect multiple user stories.
@@ -323,6 +350,7 @@
323350
- **Phase 0 (Scaffolding)**: No dependencies — start immediately
324351
- **Phase 1 (Infrastructure)**: Depends on Phase 0 — BLOCKS all user stories
325352
- **Phase 1b (Annexed Content / FR-022)**: Depends on Phase 1. Can be done at any point but SHOULD be done before real-world usage on git-annex/DataLad datasets. Retroactively completes VCS integration from Phase 1.
353+
- **Phase 1c (Symlink Safety & Dry-Run Detail / FR-003, FR-023, FR-024)**: Depends on Phase 1b. BLOCKS real-world usage on annexed datasets — the symlink bug causes silent data loss (files not renamed). Should be done immediately after Phase 1b.
326354
- **Phase 2 (Rename / US1)**: Depends on Phase 1
327355
- **Phase 3 (Migrate 1.x / US2)**: Depends on Phase 2 (uses rename for suffix changes)
328356
- **Phase 4 (Migrate 2.0 / US3)**: Depends on Phase 3

src/bids_utils/_io.py

Lines changed: 6 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -7,6 +7,7 @@
77
from __future__ import annotations
88

99
import json
10+
import logging
1011
import warnings
1112
from pathlib import Path
1213
from typing import TYPE_CHECKING, Any
@@ -16,6 +17,8 @@
1617
if TYPE_CHECKING:
1718
from bids_utils._vcs import VCSBackend
1819

20+
logger = logging.getLogger(__name__)
21+
1922

2023
def ensure_content(
2124
path: Path,
@@ -42,6 +45,7 @@ def ensure_content(
4245
return
4346

4447
if mode is AnnexedMode.GET:
48+
logger.info("Fetching annexed content: %s", path)
4549
vcs.get_content([path])
4650
return
4751

@@ -73,6 +77,7 @@ def ensure_writable(path: Path, vcs: VCSBackend) -> None:
7377
"""
7478
if path.is_symlink() and path.exists():
7579
# Locked annexed file with content present — unlock it
80+
logger.debug("Unlocking annexed file: %s", path)
7681
vcs.unlock([path])
7782

7883

@@ -84,6 +89,7 @@ def mark_modified(paths: list[Path], vcs: VCSBackend) -> None:
8489
(Git.add stages the file, NoVCS does nothing).
8590
"""
8691
if paths:
92+
logger.debug("Re-adding modified files: %s", [str(p) for p in paths])
8793
vcs.add(paths)
8894

8995

src/bids_utils/_sidecars.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -63,7 +63,7 @@ def find_sidecars(
6363
if sidecar_ext == ext:
6464
continue # Skip the primary file's own extension
6565
candidate = parent / f"{stem}{sidecar_ext}"
66-
if candidate.is_file():
66+
if candidate.exists() or candidate.is_symlink():
6767
sidecars.append(candidate)
6868

6969
return sidecars

src/bids_utils/cli/_common.py

Lines changed: 41 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -4,6 +4,7 @@
44

55
import functools
66
import json
7+
import logging
78
import os
89
import sys
910
from collections.abc import Callable
@@ -23,8 +24,14 @@ def common_options(f: Callable[..., Any]) -> Callable[..., Any]:
2324
@click.option(
2425
"--dry-run",
2526
"-n",
26-
is_flag=True,
27-
help="Show what would change without modifying files.",
27+
is_flag=False,
28+
flag_value="overview",
29+
default=None,
30+
type=click.Choice(["overview", "detailed"]),
31+
help=(
32+
"Show what would change without modifying files. "
33+
"Use --dry-run=detailed for per-file listing."
34+
),
2835
)
2936
@click.option("--json", "json_output", is_flag=True, help="Output results as JSON.")
3037
@click.option("-v", "--verbose", count=True, help="Increase verbosity.")
@@ -37,6 +44,23 @@ def common_options(f: Callable[..., Any]) -> Callable[..., Any]:
3744
)
3845
@functools.wraps(f)
3946
def wrapper(**kwargs: Any) -> Any:
47+
# Configure logging from -v / -q
48+
# Default: INFO (shows annex get operations)
49+
# -v: DEBUG (shows unlock/add details)
50+
# -q: WARNING (suppresses info messages)
51+
verbose = kwargs.get("verbose", 0)
52+
quiet = kwargs.get("quiet", False)
53+
if quiet:
54+
level = logging.WARNING
55+
elif verbose:
56+
level = logging.DEBUG
57+
else:
58+
level = logging.INFO
59+
logging.basicConfig(
60+
level=level,
61+
format="%(message)s",
62+
force=True,
63+
)
4064
return f(**kwargs)
4165

4266
return wrapper
@@ -70,7 +94,7 @@ def load_dataset(path: Path | None = None) -> BIDSDataset:
7094
def output_result(
7195
result: OperationResult,
7296
json_output: bool,
73-
dry_run: bool,
97+
dry_run: str | None,
7498
*,
7599
exit_code: int = 2,
76100
) -> None:
@@ -83,16 +107,28 @@ def output_result(
83107
json_output
84108
If ``True``, emit a JSON document.
85109
dry_run
86-
Used for the ``[DRY RUN]`` prefix in text mode.
110+
``"overview"`` for summary, ``"detailed"`` for per-file listing,
111+
or ``None`` / falsy when not in dry-run mode.
87112
exit_code
88113
Exit code to use when ``result.success`` is ``False``.
89114
"""
90115
if json_output:
91116
click.echo(json.dumps(result.to_dict(), indent=2))
92117
else:
93118
prefix = "[DRY RUN] " if dry_run else ""
119+
detailed = dry_run == "detailed"
120+
94121
for change in result.changes:
95-
click.echo(f"{prefix}{change.detail}")
122+
if detailed:
123+
# Per-file: show action, source → target
124+
src = change.source
125+
tgt = f" → {change.target}" if change.target else ""
126+
click.echo(f"{prefix}{change.action}: {src}{tgt}")
127+
else:
128+
# Overview: skip indented detail lines (per-file items)
129+
if change.detail.startswith(" "):
130+
continue
131+
click.echo(f"{prefix}{change.detail}")
96132
for w in result.warnings:
97133
click.echo(f"Warning: {w}", err=True)
98134
for err in result.errors:

src/bids_utils/cli/merge.py

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -24,7 +24,7 @@ def merge(
2424
output: str,
2525
into_sessions: tuple[str, ...],
2626
on_conflict: str,
27-
dry_run: bool,
27+
dry_run: str | None,
2828
json_output: bool,
2929
verbose: int,
3030
quiet: bool,
@@ -39,7 +39,7 @@ def merge(
3939
output,
4040
into_sessions=sessions,
4141
on_conflict=on_conflict, # type: ignore[arg-type]
42-
dry_run=dry_run,
42+
dry_run=bool(dry_run),
4343
)
4444

4545
output_result(result, json_output, dry_run)

0 commit comments

Comments
 (0)