fix(crawler): switch to CommonMark-compliant markdown parser by larryro · Pull Request #751 · tale-project/tale

larryro · 2026-03-11T07:23:50Z

Summary

Replace python-markdown with markdown-it-py (CommonMark-compliant) in the crawler service to fix headings being swallowed after tables and other parsing inconsistencies
Remove the _normalize_markdown_headings workaround that was needed to patch python-markdown's non-standard behavior
Trim leading/trailing whitespace from LLM text output in execute_agent_with_tools.ts to prevent leading-space headings from being misinterpreted

Test plan

Replaced test_markdown_normalize.py with comprehensive test_markdown_to_html.py covering headings, tables, code fences, inline formatting, and realistic LLM output
Run crawler test suite: cd services/crawler && uv run pytest
Verify contract comparison workflow produces correct HTML headings

🤖 Generated with Claude Code

Summary by CodeRabbit

New Features
- Enhanced markdown rendering with improved support for standard CommonMark formatting, including better handling of headings, code blocks, and tables.
Bug Fixes
- Fixed output text trimming to properly remove leading and trailing whitespace from agent tool results.

Replace Python-Markdown with markdown-it-py for markdown-to-HTML conversion. Python-Markdown does not support leading spaces before ATX headings (e.g., ` # Heading`), which breaks DOCX report generation when LLM outputs start with a leading space — a common Claude API behavior. markdown-it-py is CommonMark-compliant and handles all edge cases natively, eliminating the need for the _normalize_markdown_headings workaround. Also trims LLM text output as defense-in-depth.

greptile-apps

Your free trial has ended. If you'd like to continue receiving code reviews, you can add a payment method here.

coderabbitai · 2026-03-11T07:31:18Z

📝 Walkthrough

Walkthrough

The pull request replaces the markdown processing library in the crawler service from Python-Markdown to markdown-it-py (CommonMark), removing custom ATX heading normalization preprocessing. The dependency is updated accordingly in the project configuration. Tests for the removed normalization function are deleted, and new comprehensive tests are added for the refactored markdown-to-html functionality. Additionally, a whitespace-trimming adjustment is made to text output handling in the LLM execution helper in the platform service.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 11.11% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (2 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title clearly and accurately summarizes the main change: replacing the markdown parser from python-markdown to a CommonMark-compliant one (markdown-it-py) in the crawler service.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

📝 Generate docstrings (stacked PR)
📝 Generate docstrings (commit on current branch)

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment
Commit unit tests in branch fix/markdown-parser-commonmark-compliance

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@services/crawler/pyproject.toml`:
- Line 16: The dependency pin "markdown-it-py>=3.0.0" is too loose and may allow
v4.x which breaks tests; change that requirement to a tested, exact or narrow
range (e.g., pin to a specific v3.x version or use ==3.x.y or ~=3.0) so the test
expectations in test_markdown_to_html.py remain stable—update the
"markdown-it-py>=3.0.0" entry in pyproject.toml to the validated version string
consistent with other exact-pinned converters.

In
`@services/platform/convex/workflow_engine/helpers/nodes/llm/execute_agent_with_tools.ts`:
- Around line 366-368: The helper currently trims the LLM result text causing
loss of meaningful leading/trailing whitespace; update the logic around
executeTextOutput so the variable outputText preserves the raw string from
isRecord(result) && typeof result.text === 'string' ? result.text : '' without
calling .trim(), and only perform a .trim() on a separate check (e.g., to
determine if the output is empty) rather than mutating the returned value;
ensure any downstream return or output uses the untrimmed outputText so code
blocks and indentation remain intact.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: ASSERTIVE

Plan: Pro

Run ID: 8b276e57-f251-4670-a471-7da7e4f930bf

📥 Commits

Reviewing files that changed from the base of the PR and between a2e85cd and b575cfb.

⛔ Files ignored due to path filters (1)

services/crawler/uv.lock is excluded by !**/*.lock

📒 Files selected for processing (5)

services/crawler/app/services/base_converter.py
services/crawler/pyproject.toml
services/crawler/tests/test_markdown_normalize.py
services/crawler/tests/test_markdown_to_html.py
services/platform/convex/workflow_engine/helpers/nodes/llm/execute_agent_with_tools.ts

💤 Files with no reviewable changes (1)

services/crawler/tests/test_markdown_normalize.py

coderabbitai · 2026-03-11T07:31:21Z

    "loguru==0.7.3",
    "httpx==0.28.1",
-    "markdown==3.10.2",
+    "markdown-it-py>=3.0.0",


⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

cat -n services/crawler/pyproject.toml

Repository: tale-project/tale

Length of output: 1769

🏁 Script executed:

find services/crawler/tests -type f -name "*.py" | head -20

Repository: tale-project/tale

Length of output: 959

🏁 Script executed:

cat -n services/crawler/tests/test_markdown_to_html.py

Repository: tale-project/tale

Length of output: 6074

🏁 Script executed:

rg "markdown_to_html" services/crawler/app -A 5 -B 2

Repository: tale-project/tale

Length of output: 2770

🏁 Script executed:

cat -n services/crawler/app/services/base_converter.py | head -50

Repository: tale-project/tale

Length of output: 2238

🏁 Script executed:

web_search markdown-it-py version differences HTML output rendering

Repository: tale-project/tale

Length of output: 164

🌐 Web query:

markdown-it-py version 3.0 vs 4.0 HTML rendering differences breaking changes

💡 Result:

Between markdown-it-py 3.0.0 (2023-06-03) and 4.0.0 (2025-08-11) there isn’t a headline “HTML renderer rewrite” style breaking change; most differences that can change rendered HTML are edge-case parsing fixes and spec-alignment tweaks.

HTML output differences you’re most likely to notice

Tables: fewer “auto-completed” empty cells

v4.0.0 includes a change to limit the number of autocompleted cells in a table, matching an upstream markdown-it change. If your Markdown has irregular row lengths (e.g., missing trailing | cells), v3 could pad out lots of empty <td> cells; v4 will cap that padding, so the emitted <table>...</table> HTML can differ (often smaller/less surprising). [1], [2]

Raw-link parsing + emphasis interaction

A bug fix changes how the parser detects “raw links” in text when they contain characters that can look like emphasis markers (e.g. _), by changing the matching behavior so URLs are recognized as intended. This can change whether you get:

one <a href="...">...</a> link, vs.

partially emphasized text (<em>...</em>) / broken link boundaries. [1], [3]

Non-HTML-but-breaking upgrade note (often hits CI/builds)

v4 requires Python >= 3.10, so upgrading from v3 on older runtimes is a hard break (even if your Markdown output would otherwise be identical). [4]

Sources

[1] markdown-it-py v4.0.0 release notes (GitHub releases) (github.com)
[2] PR #364 “limit the number of autocompleted cells in a table” (github.com)
[3] PR #320 “Fix emphasis inside raw links bugs” (github.com)
[4] PyPI project page (shows v4.0.0 and “Requires: Python >=3.10”) (pypi.org)

Citations:

1: https://github.com/executablebooks/markdown-it-py/releases

2: 👌 limit the number of autocompleted cells in a table executablebooks/markdown-it-py#364

3: 🐛 Fix emphasis inside raw links bugs executablebooks/markdown-it-py#320

4: https://pypi.org/project/markdown-it-py/

Pin markdown-it-py to a tested version range.

Tests in test_markdown_to_html.py depend on specific HTML output semantics—particularly table rendering and heading-after-table parsing. Between markdown-it-py v3.0.0 and v4.0.0, there are documented breaking changes in table cell auto-completion and raw-link emphasis parsing that alter emitted HTML. A bare >=3.0.0 allows v4.x installation, which would break these tests' assertions like assert html.count("<h1>") == 2 and table structure checks. This weakens reproducibility for the contract comparison report tests and conflicts with the file's pattern of exact-pinning converter dependencies.

♻️ Proposed change

- "markdown-it-py>=3.0.0", + "markdown-it-py>=3.0.0,<4.0.0",

Alternatively, match the exact-pin style of similar converter dependencies (beautifulsoup4==4.14.3, python-pptx==1.0.2, etc.) if v3 has been validated.

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

"markdown-it-py>=3.0.0",

"markdown-it-py>=3.0.0,<4.0.0",

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@services/crawler/pyproject.toml` at line 16, The dependency pin "markdown-it-py>=3.0.0" is too loose and may allow v4.x which breaks tests; change that requirement to a tested, exact or narrow range (e.g., pin to a specific v3.x version or use ==3.x.y or ~=3.0) so the test expectations in test_markdown_to_html.py remain stable—update the "markdown-it-py>=3.0.0" entry in pyproject.toml to the validated version string consistent with other exact-pinned converters.

coderabbitai · 2026-03-11T07:31:22Z

+  const outputText = (
+    isRecord(result) && typeof result.text === 'string' ? result.text : ''
+  ).trim();


⚠️ Potential issue | 🟠 Major

Don’t trim the returned text in this shared helper.

executeTextOutput is the generic text path, so .trim() changes legitimate outputs: leading indentation, trailing newlines, and any markdown/code block that depends on exact whitespace. It also conflicts with the CommonMark behavior you codified in services/crawler/tests/test_markdown_to_html.py, Lines 33-36, where four leading spaces must remain an indented code block. Keep the raw text for the return value and use trimming only for the empty-output check.

🐛 Proposed fix

- const outputText = ( - isRecord(result) && typeof result.text === 'string' ? result.text : '' - ).trim(); + const outputText = + isRecord(result) && typeof result.text === 'string' ? result.text : ''; - if (!outputText || !outputText.trim()) { + if (!outputText.trim()) {

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@services/platform/convex/workflow_engine/helpers/nodes/llm/execute_agent_with_tools.ts` around lines 366 - 368, The helper currently trims the LLM result text causing loss of meaningful leading/trailing whitespace; update the logic around executeTextOutput so the variable outputText preserves the raw string from isRecord(result) && typeof result.text === 'string' ? result.text : '' without calling .trim(), and only perform a .trim() on a separate check (e.g., to determine if the output is empty) rather than mutating the returned value; ensure any downstream return or output uses the untrimmed outputText so code blocks and indentation remain intact.

greptile-apps bot reviewed Mar 11, 2026

View reviewed changes

coderabbitai bot requested changes Mar 11, 2026

View reviewed changes

larryro merged commit 4671d61 into main Mar 11, 2026
16 checks passed

larryro deleted the fix/markdown-parser-commonmark-compliance branch March 11, 2026 08:10

yannickmonney pushed a commit that referenced this pull request Apr 8, 2026

fix(crawler): switch to CommonMark-compliant markdown parser (#751)

b923bce

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(crawler): switch to CommonMark-compliant markdown parser#751

fix(crawler): switch to CommonMark-compliant markdown parser#751
larryro merged 1 commit intomainfrom
fix/markdown-parser-commonmark-compliance

larryro commented Mar 11, 2026 •

edited by coderabbitai bot

Loading

Uh oh!

greptile-apps bot left a comment

Uh oh!

coderabbitai bot commented Mar 11, 2026

Walkthrough

Estimated code review effort

❌ Failed checks (1 warning)

Uh oh!

coderabbitai bot left a comment

Uh oh!

coderabbitai bot Mar 11, 2026

Uh oh!

coderabbitai bot Mar 11, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

larryro commented Mar 11, 2026 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Test plan

Summary by CodeRabbit

Uh oh!

greptile-apps bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot commented Mar 11, 2026

Walkthrough

Estimated code review effort

❌ Failed checks (1 warning)

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Mar 11, 2026

Choose a reason for hiding this comment

HTML output differences you’re most likely to notice

Non-HTML-but-breaking upgrade note (often hits CI/builds)

Sources

Uh oh!

coderabbitai bot Mar 11, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

larryro commented Mar 11, 2026 •

edited by coderabbitai bot

Loading