Skip to content

Latest commit

 

History

History
144 lines (107 loc) · 5.29 KB

File metadata and controls

144 lines (107 loc) · 5.29 KB

Google Docs API Unicode Characters Guide

This guide addresses the issue reported in #2547 where Google Docs API returns date elements and other smart chips as Unicode placeholder characters.

The Issue

When using the Google Docs API to retrieve document content, date elements, smart chips, and other non-text elements are returned as Unicode Private Use Area characters, specifically \ue907 (U+E907).

Example from Issue #2547

# This is what you see in the API response:
{'startIndex': 1, 'endIndex': 5, 'textRun': {'content': '\ue907 | ', 'textStyle': {}}}

Instead of getting readable text like "Jan 13, 2025", you get the Unicode character \ue907.

Why This Happens

This is intentional behavior by the Google Docs API, not a bug in the client library. According to the official API documentation:

TextRun.content: "The text of this run. Any non-text elements in the run are replaced with the Unicode character U+E907."

Google Docs uses Private Use Area Unicode characters to represent:

  • Smart chips (dates, people, places)
  • Code blocks
  • Special embedded elements
  • Rich content that isn't plain text

Solution: Extract Data from Rich Links

The actual date/chip information is typically available in richLink properties within the same document elements.

Quick Start

from googleapiclient.discovery import build

# Unicode placeholder character used by Google Docs API  
SMART_CHIP_PLACEHOLDER = '\ue907'

# When processing document elements
for element in elements:
    if 'textRun' in element:
        content = element['textRun']['content']
        
        # Check for smart chip placeholder
        if SMART_CHIP_PLACEHOLDER in content:
            print("Found smart chip/date placeholder!")
            
            # Make readable for display
            readable = content.replace(SMART_CHIP_PLACEHOLDER, '[DATE/CHIP]')
            print(f"Display version: {readable}")
    
    # IMPORTANT: Look for actual data in rich links
    if 'richLink' in element:
        props = element['richLink'].get('richLinkProperties', {})
        actual_date = props.get('title', '')  # Real date is here!
        calendar_uri = props.get('uri', '')
        print(f"Actual date: {actual_date}")

Key Functions You'll Need

Basic Unicode Detection

def has_smart_chips(text):
    """Check if text contains Google Docs smart chip placeholders."""
    return '\ue907' in text

def make_readable(text):
    """Replace Unicode placeholders with readable text."""
    return text.replace('\ue907', '[DATE/CHIP]')

Extracting Actual Date Information

While the text content shows \ue907, you can find the actual date/chip data in:

1. Rich Link Properties (Primary Solution)

# Look for richLink elements with the actual data
for element in elements:
    if 'richLink' in element:
        rich_link = element['richLink']
        if 'richLinkProperties' in rich_link:
            props = rich_link['richLinkProperties']
            actual_date = props.get('title', '')    # "Jan 13, 2025" 
            calendar_uri = props.get('uri', '')     # Calendar event link
            
            print(f"Found actual date: {actual_date}")
            if 'calendar.google.com' in calendar_uri:
                print("This is a Google Calendar event!")

2. Context Analysis

Analyze surrounding elements for clues:

def find_date_context(elements, chip_index):
    """Look at adjacent elements for date context."""
    before = elements[chip_index - 1] if chip_index > 0 else None
    after = elements[chip_index + 1] if chip_index < len(elements) - 1 else None
    # Examine textRun content in adjacent elements

3. Google Calendar API Integration

For calendar events, use the Calendar API:

from googleapiclient.discovery import build

def get_event_details(calendar_uri, credentials):
    """Extract full event details from calendar URI."""
    calendar_service = build('calendar', 'v3', credentials=credentials)
    # Parse event ID from URI and fetch complete event data

Complete Example

See examples/docs_unicode_dates_example.py for a working example that demonstrates:

  • Detecting Unicode placeholder characters
  • Extracting actual dates from rich link properties
  • Analyzing document structure
  • Best practices for handling smart chips

Best Practices

  1. Always check for rich links - They often contain the actual data
  2. Use context analysis - Surrounding text can provide clues
  3. Don't rely solely on textRun content - It's designed to be a fallback
  4. Consider the document structure - Smart chips are part of larger semantic elements
  5. Use appropriate APIs - For calendar events, use Calendar API; for contacts, use People API

Important Notes

  • This is not a bug - it's intentional API design
  • The client library correctly returns what the API provides
  • Focus on extracting semantic meaning rather than display text
  • Different smart chip types may require different extraction strategies

References