Skip to content

feat: Add converter for legacy .doc files via unword#2058

Open
dridk wants to merge 1 commit into
microsoft:mainfrom
dridk:feat/doc-converter-unword
Open

feat: Add converter for legacy .doc files via unword#2058
dridk wants to merge 1 commit into
microsoft:mainfrom
dridk:feat/doc-converter-unword

Conversation

@dridk
Copy link
Copy Markdown

@dridk dridk commented Jun 2, 2026

Summary

Adds support for converting legacy Microsoft Word 97-2003 .doc files (OLE/CFB binary format) to Markdown, addressing the long-standing request in #23.

The conversion uses unword — a small, Rust-backed Python package with no external binary dependencies (no LibreOffice/COM/antiword). unword is shipped on PyPI with the Rust libraries compiled for Windows, Mac and Linux.

Doc parsing is fully handled by unword (built with Claude Code using the the legacy Microsoft Word Specification published by Microsoft ). So this PR stays very small.

Adds a DocConverter that converts legacy Microsoft Word 97-2003 (.doc,
OLE/CFB binary) files to Markdown using the `unword` package, addressing
the long-standing request in microsoft#23.

- New DocConverter (extension `.doc` / mimetype `application/msword`),
  registered after DocxConverter and gated behind an optional `[doc]`
  extra (`unword>=0.2.2`).
- Preserves heading levels; optional `keep_textboxes` kwarg (default
  True) appends extracted textbox contents.
- Adds a test fixture and test vector.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@dridk
Copy link
Copy Markdown
Author

dridk commented Jun 2, 2026

@microsoft-github-policy-service agree company="CHU BREST"

@angelinashepherd
Copy link
Copy Markdown

desperately would love this to be merged in!!!! Such a big unlock

@angelinashepherd
Copy link
Copy Markdown

angelinashepherd commented Jun 3, 2026

Hi @afourney @yungshinlintw could you please review? 🙏

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants