fix(directory-scanner): detect marimo notebooks with long module docstrings (#9647)#9652
Conversation
…trings (#9647) `is_marimo_app` only scanned the first 512 bytes for `import marimo` / `marimo.App`, so a module docstring large enough to push those markers past the window silently excluded the notebook from folder scans. Keep the 512-byte fast path, but fall back to a full-file scan (capped at 10 MB) on a miss — this also subsumes the prior `# /// script` special case. Also refactor the test module to use `tmp_path`, drop the `unittest` boilerplate, and harden `test_skip_common_directories` (the old assertion was tautological).
|
The latest updates on your projects. Learn more about Vercel for GitHub.
|
There was a problem hiding this comment.
Pull request overview
Fixes folder-scanning notebook detection when import marimo / marimo.App appear after the first 512 bytes (e.g., long module docstrings or large PEP 723 # /// script headers), so notebooks are no longer silently omitted from directory galleries.
Changes:
- Update
is_marimo_app()to keep a 512-byte fast path but fall back to a bounded slow-path scan (up to 10 MB) when markers aren’t found in the header. - Refactor directory-scanner tests to pytest +
tmp_path, add coverage for long-docstring and large script-header cases, and strengthen skip-directory assertions.
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.
| File | Description |
|---|---|
marimo/_server/files/directory_scanner.py |
Adds slow-path full-file detection (size-capped) to avoid missing notebooks when markers occur after the header window. |
tests/_server/test_directory_scanner.py |
Rewrites tests to pytest style, adds regression tests for issue #9647, and improves scan/skip behavior assertions. |
| # Python: try the fast path first. | ||
| if contains_marimo_app(header): | ||
| return True | ||
|
|
||
| if b"# /// script" in header: | ||
| full_content = path.read_bytes() | ||
| if contains_marimo_app(full_content): | ||
| return True | ||
|
|
||
| return False | ||
| # Header didn't match — could be a long module docstring, a | ||
| # `# /// script` block, or simply not a marimo file. Fall back to | ||
| # a full-file scan, bounded by file size. | ||
| if len(header) < READ_LIMIT: | ||
| # Whole file already in `header`; markers absent. | ||
| return False | ||
| try: | ||
| size = os.path.getsize(full_path) | ||
| except OSError: | ||
| return False | ||
| if size > MAX_FILE_SIZE: | ||
| return False | ||
| return contains_marimo_app(path.read_bytes()) |
There was a problem hiding this comment.
1 issue found across 2 files
Architecture diagram
sequenceDiagram
participant User
participant API as Directory Scanner API
participant Scanner as DirectoryScanner
participant ScannerFS as Scanner (filesystem)
participant isMarimo as is_marimo_app()
participant FileSystem
participant Caller as PR Reviewer
Note over User,Caller: NEW: Directory scanning flow for marimo notebook detection
User->>API: POST /api/directory/scan
API->>Scanner: scan()
Scanner->>ScannerFS: collect files/dirs
ScannerFS->>ScannerFS: skip hidden, symlinks, common dirs
loop For each .py/.md/.qmd file
Scanner->>isMarimo: is_marimo_app(file_path)
isMarimo->>FileSystem: read first 512 bytes
FileSystem-->>isMarimo: header content
alt Markdown file (.md/.qmd)
isMarimo->>isMarimo: check "marimo-version:" in header
alt Found
isMarimo-->>Scanner: True (fast path)
else Not found
isMarimo-->>Scanner: False
end
else Python file (.py)
isMarimo->>isMarimo: check "marimo.App" AND "import marimo" in header
alt Found
isMarimo-->>Scanner: True (fast path)
else Not found (fast path miss)
Note over isMarimo: NEW: Fallback to full-file scan
isMarimo->>FileSystem: check file size
FileSystem-->>isMarimo: size
alt File size < 10 MB
isMarimo->>FileSystem: read entire file
FileSystem-->>isMarimo: full content
isMarimo->>isMarimo: scan full content for markers
alt Found
isMarimo-->>Scanner: True (slow path)
else Not found
isMarimo-->>Scanner: False
end
else File size >= 10 MB
isMarimo-->>Scanner: False (too large)
end
end
end
Scanner->>Scanner: collect result
end
Scanner-->>API: FileInfo list
API-->>User: directory listing with marimo annotations
Note over isMarimo: KEY IMPROVEMENTS
Note over isMarimo: - Handles long module docstrings (>512 bytes)
Note over isMarimo: - Handles # /// script headers pushed past 512 bytes
Note over isMarimo: - 10 MB cap prevents stalling on huge files
Note over isMarimo: - Replaces separate # /// script special case
Reply with feedback, questions, or to request a fix.
Re-trigger cubic
…ile cap Follow-up to the previous fix: - Apply the same fast-path + fallback to markdown so a long YAML frontmatter can't hide `marimo-version:` either. - Drop the `getsize` round-trip and second `read_bytes` — keep the file handle open and read the remainder in one pass. - Raise the slow-path cap from 10 MB to 50 MB; notebooks with embedded base64 data can plausibly exceed 10 MB. The scanner's 10s timeout is still the backstop for pathological inputs.
The cap bounds how far we read looking for markers, not the file size overall. Marimo notebooks put `import marimo` near the top regardless of how much embedded data follows, so 10 MB is already absurd headroom.
The cap bounds how far we scan looking for markers. Above `import marimo` there's only ever a shebang, comments, a docstring, and/or a `# /// script` block — none of which realistically exceed a few hundred KB. Also rename the constant to `MAX_SCAN_BYTES` so it isn't confused with a file-size limit.
There was a problem hiding this comment.
1 issue found across 2 files (changes from recent commits).
Tip: Review your code locally with the cubic CLI to iterate faster.
Re-trigger cubic
is_marimo_apponly scanned the first 512 bytes forimport marimo/marimo.App, so a module docstring large enough to push those markerspast the window silently excluded the notebook from folder scans. Keep
the 512-byte fast path, but fall back to a full-file scan (capped at
10 MB) on a miss — this also subsumes the prior
# /// scriptspecialcase.
Also refactor the test module to use
tmp_path, drop theunittestboilerplate, and harden
test_skip_common_directories(the oldassertion was tautological).
Fixes #9647