feat(gdocs): extract smart chips (date, person, rich link) when reading documents#635
feat(gdocs): extract smart chips (date, person, rich link) when reading documents#635argo49 wants to merge 2 commits into
Conversation
…ng documents Date chips, person chips, and rich link chips were silently dropped when reading Google Docs because only textRun elements were handled. Add support for dateElement, person, and richLink paragraph elements across markdown conversion, plain text extraction, and structure parsing. Consolidate duplicated paragraph extraction logic into a shared extract_paragraph_text() function.
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: defaults Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (1)
🚧 Files skipped from review as they are similar to previous changes (1)
📝 WalkthroughWalkthroughThis PR adds support for Changes
Sequence Diagram(s)(omitted) Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes Possibly related PRs
Suggested labels
Poem
🚥 Pre-merge checks | ✅ 2 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
gdocs/docs_structure.py (1)
202-216:⚠️ Potential issue | 🔴 CriticalCritical:
_extract_paragraph_textis undefined — function was renamed but this call site was not updated.Line 208 calls
_extract_paragraph_text()which no longer exists. The function was renamed toextract_paragraph_text()(public, without underscore prefix) but this call site in_parse_segment()was not updated. This will raise aNameErrorat runtime when parsing document headers or footers.🐛 Proposed fix
def _parse_segment(segment_data: dict[str, Any]) -> dict[str, Any]: """Parse a document segment (header/footer).""" content = segment_data.get("content", []) text_parts = [] for element in content: if "paragraph" in element: - text_parts.append(_extract_paragraph_text(element["paragraph"])) + text_parts.append(extract_paragraph_text(element["paragraph"])) return { "content": content, "start_index": content[0].get("startIndex", 0) if content else 0, "end_index": content[-1].get("endIndex", 0) if content else 0, "text_preview": "".join(text_parts)[:100], "element_count": len(content), }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@gdocs/docs_structure.py` around lines 202 - 216, The call in _parse_segment still uses the old private name _extract_paragraph_text which no longer exists; update the call to use the renamed public function extract_paragraph_text (i.e., replace _extract_paragraph_text(...) with extract_paragraph_text(...)) and ensure the function is imported or defined in the same module so no NameError occurs at runtime.
🧹 Nitpick comments (1)
tests/gdocs/test_docs_markdown.py (1)
734-748: Consider movingDATE_ELEMENT_DOCfixture beforeTestStructureWithChipsfor clarity.The
DATE_ELEMENT_DOCfixture (line 753) is defined after theTestStructureWithChipsclass that references it (line 746). While Python resolves names at call time so this works correctly, placing fixtures before their usage improves readability and follows the pattern established by other fixtures in this file.Also applies to: 751-781
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tests/gdocs/test_docs_markdown.py` around lines 734 - 748, The TestStructureWithChips class references the DATE_ELEMENT_DOC fixture which is defined later in the file; move the DATE_ELEMENT_DOC fixture definition so it appears before the TestStructureWithChips class (and similarly reorder any other fixtures in the 751-781 range that are used by tests) to improve readability and follow the file's fixture-before-usage pattern — update the file so DATE_ELEMENT_DOC (and related fixtures) are declared prior to TestStructureWithChips while keeping names and contents unchanged.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Outside diff comments:
In `@gdocs/docs_structure.py`:
- Around line 202-216: The call in _parse_segment still uses the old private
name _extract_paragraph_text which no longer exists; update the call to use the
renamed public function extract_paragraph_text (i.e., replace
_extract_paragraph_text(...) with extract_paragraph_text(...)) and ensure the
function is imported or defined in the same module so no NameError occurs at
runtime.
---
Nitpick comments:
In `@tests/gdocs/test_docs_markdown.py`:
- Around line 734-748: The TestStructureWithChips class references the
DATE_ELEMENT_DOC fixture which is defined later in the file; move the
DATE_ELEMENT_DOC fixture definition so it appears before the
TestStructureWithChips class (and similarly reorder any other fixtures in the
751-781 range that are used by tests) to improve readability and follow the
file's fixture-before-usage pattern — update the file so DATE_ELEMENT_DOC (and
related fixtures) are declared prior to TestStructureWithChips while keeping
names and contents unchanged.
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 4903999c-1651-400b-9f0c-55e720cc92d5
📒 Files selected for processing (4)
gdocs/docs_markdown.pygdocs/docs_structure.pygdocs/docs_tools.pytests/gdocs/test_docs_markdown.py
|
I think we may have resolved this in #649 - can you confirm? |
Description
Date chips, person chips, and rich link chips were silently dropped when reading Google Docs because only textRun elements were handled. Add support for dateElement, person, and richLink paragraph elements across markdown conversion, plain text extraction, and structure parsing. Consolidate duplicated paragraph extraction logic into a shared extract_paragraph_text() function.
Type of Change
Testing
Checklist
Additional Notes
Add any other context about the pull request here.
To enable this setting:
Summary by CodeRabbit
New Features
Tests
Chores