-
Notifications
You must be signed in to change notification settings - Fork 95
DOC-13760 produce markdown per page (WIP) #863
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
This PR implements an Antora extension to generate Markdown files from HTML output for LLM consumption. The extension creates .md files alongside HTML pages and adds alternate link metadata for each page.
Key Changes:
- New Antora extension that converts HTML to Markdown using JSDOM and semantic markdown conversion
- Configuration updates across preview and staging playbooks to enable the extension
- Addition of new npm dependencies for HTML-to-Markdown conversion
Reviewed changes
Copilot reviewed 6 out of 7 changed files in this pull request and generated 4 comments.
Show a summary per file
| File | Description |
|---|---|
| package.json | Adds dependencies for DOM parsing and HTML-to-Markdown conversion |
| lib/markdown-for-llm.js | Implements the core extension logic for converting pages to Markdown |
| home/preview/DOC-13760-produce-markdown-per-page.yml | Configures UI bundle override for preview environment |
| antora-playbook.preview.yml | Registers the markdown extension and adds analytics branch |
| antora-playbook-staging-chatbot.yml | Registers extension, updates UI bundle, and modifies SDK branches |
| antora-playbook-staging-chatbot.diff.yml | Adds extension registration and uncomments UI bundle configuration |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| function overrideElementProcessing (element) { | ||
|
|
||
| if (element.tagName?.toLowerCase() === 'a' | ||
| && element.className === 'anchor' ) { |
Copilot
AI
Nov 27, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The condition uses className (string) for comparison, but the admonition check below uses classList.contains(). Use classList.contains('anchor') for consistency and to handle multiple classes correctly.
| && element.className === 'anchor' ) { | |
| && element.classList?.contains('anchor') ) { |
| element.classList.remove('admonitionblock') | ||
| const admonition = element.className.toUpperCase() |
Copilot
AI
Nov 27, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After removing 'admonitionblock' from classList, element.className may contain other classes or be empty. This will produce incorrect admonition types. Extract the admonition class name before removing 'admonitionblock' or use a more specific selector to identify the admonition type.
| element.classList.remove('admonitionblock') | |
| const admonition = element.className.toUpperCase() | |
| // Extract admonition type before removing 'admonitionblock' | |
| const admonitionType = Array.from(element.classList).find(cls => cls !== 'admonitionblock') || ''; | |
| element.classList.remove('admonitionblock') | |
| const admonition = admonitionType.toUpperCase() |
| const path = page.out.path.replace(/\.html$/, '.md') | ||
|
|
||
| // tell docs-ui to output <link rel="alternate" ...> for the markdown page. | ||
| page.asciidoc.attributes["page-markdown-alt"] = `${page.out.rootPath}/${path}` | ||
|
|
||
| siteCatalog.addFile({ | ||
| contents: Buffer.from(markdown), | ||
| out: { path } |
Copilot
AI
Nov 27, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[nitpick] The variable name path shadows the outer scope if Node's path module is imported elsewhere. Consider renaming to mdPath or markdownPath to avoid potential confusion.
| const path = page.out.path.replace(/\.html$/, '.md') | |
| // tell docs-ui to output <link rel="alternate" ...> for the markdown page. | |
| page.asciidoc.attributes["page-markdown-alt"] = `${page.out.rootPath}/${path}` | |
| siteCatalog.addFile({ | |
| contents: Buffer.from(markdown), | |
| out: { path } | |
| const mdPath = page.out.path.replace(/\.html$/, '.md') | |
| // tell docs-ui to output <link rel="alternate" ...> for the markdown page. | |
| page.asciidoc.attributes["page-markdown-alt"] = `${page.out.rootPath}/${mdPath}` | |
| siteCatalog.addFile({ | |
| contents: Buffer.from(markdown), | |
| out: { path: mdPath } |
lib/markdown-for-llm.js
Outdated
| module.exports.register = function ({ playbook, config }) { | ||
| const logger = this.getLogger('markdown-for-llm') | ||
|
|
||
| this.on('navigationBuilt', async ({ playbook, siteAsciiDocConfig, siteCatalog, uiCatalog, contentCatalog }) => { |
Copilot
AI
Nov 27, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The event handler is marked async but contains no await operations. Remove the async keyword as it's unnecessary and may create unneeded promise overhead when processing all pages.
| this.on('navigationBuilt', async ({ playbook, siteAsciiDocConfig, siteCatalog, uiCatalog, contentCatalog }) => { | |
| this.on('navigationBuilt', ({ playbook, siteAsciiDocConfig, siteCatalog, uiCatalog, contentCatalog }) => { |
copilot review suggests this may introduce unneeded overhead. As we're getting Node memory errors, then we may as well try!
This PR uses Antora's extension mechanism to generate .md files at the time of publishing the site.
We publish them as *.md in the same directory, and link to the file with a
<link rel="alternate" ...>for each page.There are a few issues to resolve.
FATAL: memory usage is too high and the Node runtime exits with
FATAL ERROR: Ineffective mark-compacts near heap limit Allocation failed - JavaScript heap out of memoryI tried to account for this with
NODE_OPTIONS: --max_old_space_size=16384(doubled from previous 8192, already increased from default 4096) with no impact.As this creates a file for Every Page stored within memory, this was always a possibility.
MITIGATION 1 (TODO): increase the heap size again
MITIGATION 2 (TODO): try only producing the .md file for the latest versions of components?
MARKUP production: as we are going from Asciidoc -> HTML -> Markdown, this is a little lossy. This POC inspects the HTML to try to rebuild Admonitions (as Github Flavoured markdown Alerts)
We could try using OpenDevise's https://github.com/opendevise/downdoc but the Antora pipeline doesn't currently seem to give the collated Asciidoc markup. e.g. looking at the Generator Events it seems like at contentClassified stage we have Asciidoc source (but without Includes processed) and at the following documentsConverted phase, we get the output HTML, but there's no intermediate step.
We could potentially use Antora Assembler, but that works on a whole site nav, whereas we're looking at producing Markdown for every single file.