Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
144 changes: 144 additions & 0 deletions docs/docs-unicode-guide.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,144 @@
# Google Docs API Unicode Characters Guide

This guide addresses the issue reported in [#2547](https://github.com/googleapis/google-api-python-client/issues/2547) where Google Docs API returns date elements and other smart chips as Unicode placeholder characters.

## The Issue

When using the Google Docs API to retrieve document content, date elements, smart chips, and other non-text elements are returned as Unicode Private Use Area characters, specifically `\ue907` (U+E907).

### Example from Issue #2547

```python
# This is what you see in the API response:
{'startIndex': 1, 'endIndex': 5, 'textRun': {'content': '\ue907 | ', 'textStyle': {}}}
```

Instead of getting readable text like "Jan 13, 2025", you get the Unicode character `\ue907`.

## Why This Happens

This is **intentional behavior** by the Google Docs API, not a bug in the client library. According to the official API documentation:

> **TextRun.content**: "The text of this run. Any non-text elements in the run are replaced with the Unicode character U+E907."

Google Docs uses Private Use Area Unicode characters to represent:
- Smart chips (dates, people, places)
- Code blocks
- Special embedded elements
- Rich content that isn't plain text

## Solution: Extract Data from Rich Links

The actual date/chip information is typically available in `richLink` properties within the same document elements.

### Quick Start

```python
from googleapiclient.discovery import build

# Unicode placeholder character used by Google Docs API
SMART_CHIP_PLACEHOLDER = '\ue907'

# When processing document elements
for element in elements:
if 'textRun' in element:
content = element['textRun']['content']

# Check for smart chip placeholder
if SMART_CHIP_PLACEHOLDER in content:
print("Found smart chip/date placeholder!")

# Make readable for display
readable = content.replace(SMART_CHIP_PLACEHOLDER, '[DATE/CHIP]')
print(f"Display version: {readable}")

# IMPORTANT: Look for actual data in rich links
if 'richLink' in element:
props = element['richLink'].get('richLinkProperties', {})
actual_date = props.get('title', '') # Real date is here!
calendar_uri = props.get('uri', '')
print(f"Actual date: {actual_date}")
```

### Key Functions You'll Need

#### Basic Unicode Detection
```python
def has_smart_chips(text):
"""Check if text contains Google Docs smart chip placeholders."""
return '\ue907' in text

def make_readable(text):
"""Replace Unicode placeholders with readable text."""
return text.replace('\ue907', '[DATE/CHIP]')
```

## Extracting Actual Date Information

While the text content shows `\ue907`, you can find the actual date/chip data in:

### 1. Rich Link Properties (Primary Solution)
```python
# Look for richLink elements with the actual data
for element in elements:
if 'richLink' in element:
rich_link = element['richLink']
if 'richLinkProperties' in rich_link:
props = rich_link['richLinkProperties']
actual_date = props.get('title', '') # "Jan 13, 2025"
calendar_uri = props.get('uri', '') # Calendar event link

print(f"Found actual date: {actual_date}")
if 'calendar.google.com' in calendar_uri:
print("This is a Google Calendar event!")
```

### 2. Context Analysis
Analyze surrounding elements for clues:
```python
def find_date_context(elements, chip_index):
"""Look at adjacent elements for date context."""
before = elements[chip_index - 1] if chip_index > 0 else None
after = elements[chip_index + 1] if chip_index < len(elements) - 1 else None
# Examine textRun content in adjacent elements
```

### 3. Google Calendar API Integration
For calendar events, use the Calendar API:
```python
from googleapiclient.discovery import build

def get_event_details(calendar_uri, credentials):
"""Extract full event details from calendar URI."""
calendar_service = build('calendar', 'v3', credentials=credentials)
# Parse event ID from URI and fetch complete event data
```

## Complete Example

See `examples/docs_unicode_dates_example.py` for a working example that demonstrates:
- Detecting Unicode placeholder characters
- Extracting actual dates from rich link properties
- Analyzing document structure
- Best practices for handling smart chips

## Best Practices

1. **Always check for rich links** - They often contain the actual data
2. **Use context analysis** - Surrounding text can provide clues
3. **Don't rely solely on textRun content** - It's designed to be a fallback
4. **Consider the document structure** - Smart chips are part of larger semantic elements
5. **Use appropriate APIs** - For calendar events, use Calendar API; for contacts, use People API

## Important Notes

- This is **not a bug** - it's intentional API design
- The client library correctly returns what the API provides
- Focus on extracting semantic meaning rather than display text
- Different smart chip types may require different extraction strategies

## References

- [GitHub Issue #2547](https://github.com/googleapis/google-api-python-client/issues/2547)
- [Google Docs API Documentation](https://developers.google.com/docs/api)
- [Unicode Private Use Areas](https://en.wikipedia.org/wiki/Private_Use_Areas)
44 changes: 44 additions & 0 deletions examples/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,44 @@
# Google API Python Client Examples

This directory contains example scripts demonstrating how to work with various Google APIs using the `google-api-python-client` library.

## Available Examples

### `docs_unicode_dates_example.py`

**Addresses**: [GitHub Issue #2547](https://github.com/googleapis/google-api-python-client/issues/2547) - "Google doc dates returned as unicode (e.g., \\ue907)"

**Purpose**: Demonstrates how to properly handle Unicode placeholder characters returned by the Google Docs API for smart chips (dates, people, places, etc.).

**Key Concepts**:
- Understanding that `\ue907` is intentional API behavior, not a bug
- Extracting actual date/chip information from `richLink` properties
- Working with Google Docs API document structure
- Best practices for handling non-text elements

**Usage**:
```bash
# Run the demonstration (no authentication required)
python examples/docs_unicode_dates_example.py

# To analyze an actual document, modify the script with your document ID
# and uncomment the analysis function call
```

**Requirements**:
- `google-auth`
- `google-api-python-client`
- Valid Google Docs API credentials (for real document analysis)

## Contributing Examples

When contributing new examples:
1. Focus on common use cases or frequently asked questions
2. Include comprehensive comments and documentation
3. Provide both demonstration code and real-world usage examples
4. Follow the existing code style and structure
5. Test thoroughly before submitting

## Authentication

Most examples require Google API credentials. See the [authentication documentation](https://cloud.google.com/docs/authentication) for setup instructions.
Loading
Loading