Skip to content

fix(directory-scanner): detect marimo notebooks with long module docstrings (#9647)#9652

Merged
mscolnick merged 4 commits into
mainfrom
ms/valencia-v3
May 21, 2026
Merged

fix(directory-scanner): detect marimo notebooks with long module docstrings (#9647)#9652
mscolnick merged 4 commits into
mainfrom
ms/valencia-v3

Conversation

@mscolnick
Copy link
Copy Markdown
Contributor

@mscolnick mscolnick commented May 21, 2026

is_marimo_app only scanned the first 512 bytes for import marimo /
marimo.App, so a module docstring large enough to push those markers
past the window silently excluded the notebook from folder scans. Keep
the 512-byte fast path, but fall back to a full-file scan (capped at
10 MB) on a miss — this also subsumes the prior # /// script special
case.

Also refactor the test module to use tmp_path, drop the unittest
boilerplate, and harden test_skip_common_directories (the old
assertion was tautological).

Fixes #9647

…trings (#9647)

`is_marimo_app` only scanned the first 512 bytes for `import marimo` /
`marimo.App`, so a module docstring large enough to push those markers
past the window silently excluded the notebook from folder scans. Keep
the 512-byte fast path, but fall back to a full-file scan (capped at
10 MB) on a miss — this also subsumes the prior `# /// script` special
case.

Also refactor the test module to use `tmp_path`, drop the `unittest`
boilerplate, and harden `test_skip_common_directories` (the old
assertion was tautological).
Copilot AI review requested due to automatic review settings May 21, 2026 17:37
@vercel
Copy link
Copy Markdown

vercel Bot commented May 21, 2026

The latest updates on your projects. Learn more about Vercel for GitHub.

Project Deployment Actions Updated (UTC)
marimo-docs Ready Ready Preview, Comment May 21, 2026 6:43pm

Request Review

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Fixes folder-scanning notebook detection when import marimo / marimo.App appear after the first 512 bytes (e.g., long module docstrings or large PEP 723 # /// script headers), so notebooks are no longer silently omitted from directory galleries.

Changes:

  • Update is_marimo_app() to keep a 512-byte fast path but fall back to a bounded slow-path scan (up to 10 MB) when markers aren’t found in the header.
  • Refactor directory-scanner tests to pytest + tmp_path, add coverage for long-docstring and large script-header cases, and strengthen skip-directory assertions.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.

File Description
marimo/_server/files/directory_scanner.py Adds slow-path full-file detection (size-capped) to avoid missing notebooks when markers occur after the header window.
tests/_server/test_directory_scanner.py Rewrites tests to pytest style, adds regression tests for issue #9647, and improves scan/skip behavior assertions.

Comment on lines +53 to +69
# Python: try the fast path first.
if contains_marimo_app(header):
return True

if b"# /// script" in header:
full_content = path.read_bytes()
if contains_marimo_app(full_content):
return True

return False
# Header didn't match — could be a long module docstring, a
# `# /// script` block, or simply not a marimo file. Fall back to
# a full-file scan, bounded by file size.
if len(header) < READ_LIMIT:
# Whole file already in `header`; markers absent.
return False
try:
size = os.path.getsize(full_path)
except OSError:
return False
if size > MAX_FILE_SIZE:
return False
return contains_marimo_app(path.read_bytes())
Comment thread tests/_server/test_directory_scanner.py
@mscolnick mscolnick added the bug Something isn't working label May 21, 2026
akshayka
akshayka previously approved these changes May 21, 2026
Copy link
Copy Markdown
Contributor

@cubic-dev-ai cubic-dev-ai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

1 issue found across 2 files

Architecture diagram
sequenceDiagram
    participant User
    participant API as Directory Scanner API
    participant Scanner as DirectoryScanner
    participant ScannerFS as Scanner (filesystem)
    participant isMarimo as is_marimo_app()
    participant FileSystem
    participant Caller as PR Reviewer

    Note over User,Caller: NEW: Directory scanning flow for marimo notebook detection

    User->>API: POST /api/directory/scan
    API->>Scanner: scan()
    Scanner->>ScannerFS: collect files/dirs
    ScannerFS->>ScannerFS: skip hidden, symlinks, common dirs
    
    loop For each .py/.md/.qmd file
        Scanner->>isMarimo: is_marimo_app(file_path)
        
        isMarimo->>FileSystem: read first 512 bytes
        FileSystem-->>isMarimo: header content
        
        alt Markdown file (.md/.qmd)
            isMarimo->>isMarimo: check "marimo-version:" in header
            alt Found
                isMarimo-->>Scanner: True (fast path)
            else Not found
                isMarimo-->>Scanner: False
            end
        else Python file (.py)
            isMarimo->>isMarimo: check "marimo.App" AND "import marimo" in header
            alt Found
                isMarimo-->>Scanner: True (fast path)
            else Not found (fast path miss)
                Note over isMarimo: NEW: Fallback to full-file scan
                isMarimo->>FileSystem: check file size
                FileSystem-->>isMarimo: size
                alt File size < 10 MB
                    isMarimo->>FileSystem: read entire file
                    FileSystem-->>isMarimo: full content
                    isMarimo->>isMarimo: scan full content for markers
                    alt Found
                        isMarimo-->>Scanner: True (slow path)
                    else Not found
                        isMarimo-->>Scanner: False
                    end
                else File size >= 10 MB
                    isMarimo-->>Scanner: False (too large)
                end
            end
        end
        
        Scanner->>Scanner: collect result
    end
    
    Scanner-->>API: FileInfo list
    API-->>User: directory listing with marimo annotations
    
    Note over isMarimo: KEY IMPROVEMENTS
    Note over isMarimo: - Handles long module docstrings (>512 bytes)
    Note over isMarimo: - Handles # /// script headers pushed past 512 bytes
    Note over isMarimo: - 10 MB cap prevents stalling on huge files
    Note over isMarimo: - Replaces separate # /// script special case
Loading

Reply with feedback, questions, or to request a fix.

Re-trigger cubic

Comment thread marimo/_server/files/directory_scanner.py Outdated
…ile cap

Follow-up to the previous fix:
- Apply the same fast-path + fallback to markdown so a long YAML
  frontmatter can't hide `marimo-version:` either.
- Drop the `getsize` round-trip and second `read_bytes` — keep the file
  handle open and read the remainder in one pass.
- Raise the slow-path cap from 10 MB to 50 MB; notebooks with embedded
  base64 data can plausibly exceed 10 MB. The scanner's 10s timeout is
  still the backstop for pathological inputs.
mscolnick added 2 commits May 21, 2026 14:03
The cap bounds how far we read looking for markers, not the file size
overall. Marimo notebooks put `import marimo` near the top regardless
of how much embedded data follows, so 10 MB is already absurd headroom.
The cap bounds how far we scan looking for markers. Above
`import marimo` there's only ever a shebang, comments, a docstring,
and/or a `# /// script` block — none of which realistically exceed a
few hundred KB. Also rename the constant to `MAX_SCAN_BYTES` so it
isn't confused with a file-size limit.
Copy link
Copy Markdown
Contributor

@cubic-dev-ai cubic-dev-ai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

1 issue found across 2 files (changes from recent commits).

Tip: Review your code locally with the cubic CLI to iterate faster.

Re-trigger cubic

Comment thread marimo/_server/files/directory_scanner.py
@mscolnick mscolnick merged commit 4a6fb52 into main May 21, 2026
42 of 43 checks passed
@mscolnick mscolnick deleted the ms/valencia-v3 branch May 21, 2026 18:23
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Notebooks silently excluded from folder gallery if module docstring pushes import marimo past first 512 bytes

3 participants