Skip to content

feat(gdocs): extract smart chips (date, person, rich link) when reading documents#635

Closed
argo49 wants to merge 2 commits into
taylorwilsdon:mainfrom
argo49:feat/chip-extraction
Closed

feat(gdocs): extract smart chips (date, person, rich link) when reading documents#635
argo49 wants to merge 2 commits into
taylorwilsdon:mainfrom
argo49:feat/chip-extraction

Conversation

@argo49
Copy link
Copy Markdown

@argo49 argo49 commented Mar 30, 2026

Description

Date chips, person chips, and rich link chips were silently dropped when reading Google Docs because only textRun elements were handled. Add support for dateElement, person, and richLink paragraph elements across markdown conversion, plain text extraction, and structure parsing. Consolidate duplicated paragraph extraction logic into a shared extract_paragraph_text() function.

Type of Change

  • Bug fix (non-breaking change which fixes an issue)
  • [ x ] New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • Documentation update

Testing

  • [ x ] I have added tests that prove my fix is effective or that my feature works
  • New and existing unit tests pass locally with my changes
  • [ x ] I have tested this change manually

Checklist

  • My code follows the style guidelines of this project
  • [ x ] I have performed a self-review of my own code
  • [ x ] I have commented my code, particularly in hard-to-understand areas
  • [ x ] My changes generate no new warnings
  • [ x ] I have enabled "Allow edits from maintainers" for this pull request

Additional Notes

Add any other context about the pull request here.

  • Tests were failing on a clean main branch run before my changes.

⚠️ IMPORTANT: This repository requires that you enable "Allow edits from maintainers" when creating your pull request. This allows maintainers to make small fixes and improvements directly to your branch, speeding up the review process.

To enable this setting:

  1. When creating the PR, check the "Allow edits from maintainers" checkbox
  2. If you've already created the PR, you can enable this in the PR sidebar under "Allow edits from maintainers"

Summary by CodeRabbit

  • New Features

    • Google Docs rich links, person references, and date elements are now recognized and converted into appropriate Markdown/text representations within paragraphs and table cells.
  • Tests

    • Added comprehensive tests covering rich links, person chips, date elements, mixed paragraphs, and table cell handling to validate Markdown output and structure extraction.
  • Chores

    • Paragraph text extraction was made a public/shared utility (resulting in an API surface change).

…ng documents

Date chips, person chips, and rich link chips were silently dropped when
reading Google Docs because only textRun elements were handled. Add
support for dateElement, person, and richLink paragraph elements across
markdown conversion, plain text extraction, and structure parsing.
Consolidate duplicated paragraph extraction logic into a shared
extract_paragraph_text() function.
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented Mar 30, 2026

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: fac27a90-ca3f-4d9b-97ef-00c287dd9a5c

📥 Commits

Reviewing files that changed from the base of the PR and between 1427764 and 5200d31.

📒 Files selected for processing (1)
  • gdocs/docs_structure.py
🚧 Files skipped from review as they are similar to previous changes (1)
  • gdocs/docs_structure.py

📝 Walkthrough

Walkthrough

This PR adds support for richLink, person, and dateElement inline elements to paragraph extraction and Markdown conversion, refactors the paragraph extractor into a public extract_paragraph_text function, updates call sites, and expands tests covering these smart-chip scenarios.

Changes

Cohort / File(s) Summary
Core element handling
gdocs/docs_markdown.py, gdocs/docs_structure.py
Add handling for richLink → Markdown link (title/uri fallback), personname (email) or fallback, and dateElementdisplayText.
Paragraph extractor refactor
gdocs/docs_structure.py, gdocs/docs_tools.py
Rename _extract_paragraph_textextract_paragraph_text (public) and replace inline paragraph-element logic with calls to the new extractor.
Tests & fixtures
tests/gdocs/test_docs_markdown.py
Add fixtures and tests for richLink variants, person chips, dateElement cases, mixed paragraphs, and table-cell chip handling; assert both Markdown output and parsed structure text.

Sequence Diagram(s)

(omitted)

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Possibly related PRs

Suggested labels

enhancement

Poem

🐰 I nibble through docs with a twitchy nose,

richLinks gleam where the link-text grows,
people appear as “Name (email)” bright,
dates hop in, displayed just right—
a carrot of markdown tucked snug for the night.

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 45.83% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (2 passed)
Check name Status Explanation
Title check ✅ Passed The pull request title clearly and specifically summarizes the main change: adding extraction support for smart chips (date, person, and rich link elements) in Google Docs document reading.
Description check ✅ Passed The pull request description follows the required template structure with all main sections completed: clear description of changes, type of change selected (new feature), testing details provided, and most checklist items checked including enabling maintainer edits.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
gdocs/docs_structure.py (1)

202-216: ⚠️ Potential issue | 🔴 Critical

Critical: _extract_paragraph_text is undefined — function was renamed but this call site was not updated.

Line 208 calls _extract_paragraph_text() which no longer exists. The function was renamed to extract_paragraph_text() (public, without underscore prefix) but this call site in _parse_segment() was not updated. This will raise a NameError at runtime when parsing document headers or footers.

🐛 Proposed fix
 def _parse_segment(segment_data: dict[str, Any]) -> dict[str, Any]:
     """Parse a document segment (header/footer)."""
     content = segment_data.get("content", [])
     text_parts = []
     for element in content:
         if "paragraph" in element:
-            text_parts.append(_extract_paragraph_text(element["paragraph"]))
+            text_parts.append(extract_paragraph_text(element["paragraph"]))
 
     return {
         "content": content,
         "start_index": content[0].get("startIndex", 0) if content else 0,
         "end_index": content[-1].get("endIndex", 0) if content else 0,
         "text_preview": "".join(text_parts)[:100],
         "element_count": len(content),
     }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@gdocs/docs_structure.py` around lines 202 - 216, The call in _parse_segment
still uses the old private name _extract_paragraph_text which no longer exists;
update the call to use the renamed public function extract_paragraph_text (i.e.,
replace _extract_paragraph_text(...) with extract_paragraph_text(...)) and
ensure the function is imported or defined in the same module so no NameError
occurs at runtime.
🧹 Nitpick comments (1)
tests/gdocs/test_docs_markdown.py (1)

734-748: Consider moving DATE_ELEMENT_DOC fixture before TestStructureWithChips for clarity.

The DATE_ELEMENT_DOC fixture (line 753) is defined after the TestStructureWithChips class that references it (line 746). While Python resolves names at call time so this works correctly, placing fixtures before their usage improves readability and follows the pattern established by other fixtures in this file.

Also applies to: 751-781

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/gdocs/test_docs_markdown.py` around lines 734 - 748, The
TestStructureWithChips class references the DATE_ELEMENT_DOC fixture which is
defined later in the file; move the DATE_ELEMENT_DOC fixture definition so it
appears before the TestStructureWithChips class (and similarly reorder any other
fixtures in the 751-781 range that are used by tests) to improve readability and
follow the file's fixture-before-usage pattern — update the file so
DATE_ELEMENT_DOC (and related fixtures) are declared prior to
TestStructureWithChips while keeping names and contents unchanged.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Outside diff comments:
In `@gdocs/docs_structure.py`:
- Around line 202-216: The call in _parse_segment still uses the old private
name _extract_paragraph_text which no longer exists; update the call to use the
renamed public function extract_paragraph_text (i.e., replace
_extract_paragraph_text(...) with extract_paragraph_text(...)) and ensure the
function is imported or defined in the same module so no NameError occurs at
runtime.

---

Nitpick comments:
In `@tests/gdocs/test_docs_markdown.py`:
- Around line 734-748: The TestStructureWithChips class references the
DATE_ELEMENT_DOC fixture which is defined later in the file; move the
DATE_ELEMENT_DOC fixture definition so it appears before the
TestStructureWithChips class (and similarly reorder any other fixtures in the
751-781 range that are used by tests) to improve readability and follow the
file's fixture-before-usage pattern — update the file so DATE_ELEMENT_DOC (and
related fixtures) are declared prior to TestStructureWithChips while keeping
names and contents unchanged.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 4903999c-1651-400b-9f0c-55e720cc92d5

📥 Commits

Reviewing files that changed from the base of the PR and between 45a5f93 and 1427764.

📒 Files selected for processing (4)
  • gdocs/docs_markdown.py
  • gdocs/docs_structure.py
  • gdocs/docs_tools.py
  • tests/gdocs/test_docs_markdown.py

@taylorwilsdon taylorwilsdon self-assigned this Apr 1, 2026
@taylorwilsdon taylorwilsdon added the enhancement New feature or request label Apr 1, 2026
@taylorwilsdon
Copy link
Copy Markdown
Owner

I think we may have resolved this in #649 - can you confirm?

@argo49 argo49 closed this Apr 8, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants