Conversation
There was a problem hiding this comment.
Pull Request Overview
This PR significantly enhances the SEO module by adding comprehensive JSON-LD structured data support, improved robots.txt generation with AI crawler controls, and llms.txt generation for AI/LLM interactions.
Key Changes
- Added 18 different JSON-LD schema types (Product, Event, JobPosting, Recipe, HowTo, etc.) with detailed field configurations
- Enhanced robots.txt with selective AI crawler controls (GPTBot, ClaudeBot, etc.)
- Implemented llms.txt generation for AI training policy management
- Added global organization settings and social profiles
- Expanded internationalization support across multiple languages
Reviewed Changes
Copilot reviewed 15 out of 15 changed files in this pull request and generated no comments.
Show a summary per file
| File | Description |
|---|---|
modules/@apostrophecms/seo-fields-global/index.js |
Added global SEO settings including AI crawler controls, organization info, social profiles, and llms.txt configuration |
modules/@apostrophecms/seo-fields-doc-type/index.js |
Added 18 schema types with conditional field groups for structured data on pages/pieces |
lib/jsonld-schemas.js |
New handler class for generating JSON-LD structured data with fallback logic and validation |
lib/nodes.js |
Enhanced head generation with theme colors, pagination, JSON-LD injection, and hreflang support |
lib/utils.js |
New utility for extracting image data from relationships |
index.js |
Added methods for custom schema registration |
i18n/*.json |
Comprehensive translations for all new fields across 6 languages |
CHANGELOG.md |
Documented new structured data feature |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
boutell
left a comment
There was a problem hiding this comment.
Excited for this.
I do have notes, see above.
Also: this is a lot of code running on every page. Where are the unit tests?
README.md
Outdated
|
|
||
| ### ⚙️ Requires Content Structure Setup | ||
|
|
||
| Advanced structured data types need specific fields in your content types: |
There was a problem hiding this comment.
Links needed to what these things are. It took me a few eyeblinks.
There was a problem hiding this comment.
I added a link to the introductory sentence. Is that enough?
README.md
Outdated
| Traditional search engines (Googlebot, Bingbot) are always allowed unless using "Block All" mode. | ||
|
|
||
| **Technical Notes:** | ||
| - A physical `robots.txt` file in your `public/` directory will override these settings |
There was a problem hiding this comment.
or sites/public and dashboard/public, for multisite
README.md
Outdated
| **Available Modes:** | ||
|
|
||
| 1. **Allow All (Search + AI)** - Default open access for all crawlers | ||
| 2. **Allow Search, Block AI Training** - Maintains search rankings while protecting content from AI training |
There was a problem hiding this comment.
"by AI agents that choose to respect this standard" (many don't)
README.md
Outdated
|
|
||
| 1. **Allow All (Search + AI)** - Default open access for all crawlers | ||
| 2. **Allow Search, Block AI Training** - Maintains search rankings while protecting content from AI training | ||
| 3. **Selective AI Crawlers** - Granular control over individual AI crawlers |
| 2. **Allow Search, Block AI Training** - Maintains search rankings while protecting content from AI training | ||
| 3. **Selective AI Crawlers** - Granular control over individual AI crawlers | ||
| 4. **Block All** - Prevents all indexing | ||
| 5. **Custom** - Write your own robots.txt content |
There was a problem hiding this comment.
Warning emoji: the only thing worse than leaving your per-launch site open to Google is forgetting to open up your launched, public site to Google. Make sure you follow up at launch time.
README.md
Outdated
| }); | ||
| ``` | ||
|
|
||
| The module automatically generates `<link rel="preload">` tags for each configured font. The `crossorigin` attribute is automatically added for absolute URLs (CDN/external fonts) and omitted for relative URLs (self-hosted fonts). |
There was a problem hiding this comment.
So this is an alternate way to load fonts? Should you remove other frontend CSS or HTML you may have written to load these fonts or is it complementary?
There was a problem hiding this comment.
Updated to clarify these points - and it is complementary.
README.md
Outdated
|
|
||
| **Where to store fonts:** | ||
|
|
||
| 1. **Self-hosted (recommended)**: Place font files in `public/fonts/` and reference as `/fonts/filename.woff2` |
There was a problem hiding this comment.
Why is self-hosted recommended? We sell hosting.
There was a problem hiding this comment.
I would call this "simple single-server deployments"
There was a problem hiding this comment.
Changed - not sure if we want to "sell" hosting here with some wording changes?
README.md
Outdated
| - Requires CORS: `Access-Control-Allow-Origin: *` | ||
| - Example: `{ url: 'https://cdn.yoursite.com/fonts/inter.woff2' }` | ||
|
|
||
| 3. **Don't use with Google Fonts**: They have their own optimization and don't benefit from preload |
There was a problem hiding this comment.
Not even when you self-serve them? Google stopped recommending use of their CDN a long time ago right?
There was a problem hiding this comment.
I didn't realize they discouraged using their CDN. Thanks! Changed.
|
|
||
| **Best practices:** | ||
| - Only preload fonts used above the fold (typically 1-2 fonts maximum) | ||
| - Use `woff2` format for best compression (supported by all modern browsers) |
lib/jsonld-schemas.js
Outdated
| } | ||
| } | ||
|
|
||
| if (process.env.APOS_SEO_DEBUG && date && fallbackSource && !fallbackSource.startsWith('schema.')) { |
There was a problem hiding this comment.
recommend writing a "log" function for this file that does this process.env test so it's not repeated with potential typos.
Tests added |
boutell
left a comment
There was a problem hiding this comment.
This continues to be very cool
In addition to my notes above...
Unit tests are cool but please add some traditional functional tests that stand up actual pieces and pages and verify it works as expected when run end to end.
| 2. **Allow Search, Block AI Training** - Maintains search rankings while protecting content from AI training | ||
| 3. **Selective AI Crawlers** - Granular control over individual AI crawlers | ||
| 2. **Allow Search, Block AI Training** - Maintains search rankings while protecting content from AI training by AI agents that choose to respect this standard | ||
| 3. **Selective AI Crawlers** - Granular control over individual AI crawlers that support this standard |
There was a problem hiding this comment.
You provided specific choices in a dropdown. Has there been testing by us or anyone else to verify that the bots in question respect the setting? If they don't, it feels like a mistake to say "here's how you block [X]" and have it just plain not work.
There was a problem hiding this comment.
See comment above. According to official docs, they respect the robots.txt, I'm not sure how I would test their real-world behavior. I modified the README to hedge a little more.
README.md
Outdated
| 5. **Custom** - Write your own robots.txt content | ||
|
|
||
| **Selective Mode Crawlers:** | ||
| For fine-grained control, use Selective mode to choose specific AI crawlers: |
There was a problem hiding this comment.
When I wrote this comment, I thought the presence of these on the list implied they really can be blocked. If that's not known they should not be listed until it is demonstrated.
| | **AI Support** | Most AI crawlers respect robots.txt | **Most AI systems do not read llms.txt** | | ||
| | **Example** | "Block GPTBot from accessing /api/*" | "Content may be used for search, not training" | | ||
|
|
||
| > **⚠️ Important:** While `llms.txt` represents forward-thinking SEO strategy, it should not be relied upon for actual crawler control. Use `robots.txt` for enforceable policies. The `llms.txt` file serves as a policy statement and may gain broader adoption over time. |
There was a problem hiding this comment.
Even robots.txt may be ignored by some AI crawlers, especially those that falsely identify themselves as web browsers.
| - **robots.txt** provides technical enforcement | ||
| - **llms.txt** clearly communicates your policies to compliant AI systems | ||
| - **robots.txt** provides technical enforcement (works now) | ||
| - **llms.txt** clearly communicates your policies (may work in the future) |
There was a problem hiding this comment.
This is a very helpful sentence.
| **Recommended approach:** Use both together: | ||
| - **robots.txt** provides technical enforcement | ||
| - **llms.txt** clearly communicates your policies to compliant AI systems | ||
| - **robots.txt** provides technical enforcement (works now) |
There was a problem hiding this comment.
respected now by many crawlers, some bad actors ignore it.
README.md
Outdated
| } | ||
| ``` | ||
| For **Article** and **Recipe** schemas, you can provide author information in multiple formats. Authors can be stored either as a simple string field or as a relationship to | ||
| `@apostrophecms/user`. The SEO module supports both. |
There was a problem hiding this comment.
I believe I flagged this as a concern: relationships to users usually won't work because users are invisible to non-admin users and the public. I would suggest allowing any relationship with the name _author and documenting it that way.
There was a problem hiding this comment.
Revised:
**Resolution order:**
1. `schema.author` string (if present and non-empty)
2. `document.author` string
3. `_author` relationship: the first joined doc on `document._author`
4. `updatedBy` user on the document: `title`, then `name`
README.md
Outdated
| ``` | ||
|
|
||
| **Option 3: Automatic fallback:** | ||
| If neither field is provided, the module uses the currently logged-in user's name. |
README.md
Outdated
| /* Your existing CSS - keep this! */ | ||
| @font-face { | ||
| font-family: 'Inter'; | ||
| src: url('/fonts/inter-variable.woff2') format('woff2'); |
There was a problem hiding this comment.
This is going to fail when deploying to the cloud. The styles will be in S3, but the fonts will not be, unless you put them in the public/ subdir of a module, and use a /modules/... path in the CSS to access them. Per our asset docs
README.md
Outdated
| **Where to store fonts:** | ||
|
|
||
| 1. **Self-hosted (recommended)**: Place font files in `public/fonts/` and reference as `/fonts/filename.woff2` | ||
| 1. **Simple single-server deployments**: Place font files in `public/fonts/` and reference as `/fonts/filename.woff2` |
|
|
||
| // helper: ensure URL is absolute | ||
| ensureAbsoluteUrl(url, baseUrl) { | ||
| if (!url) { |
There was a problem hiding this comment.
Why not use the URL constructor with its second argument to save a lot of code?
| // Verify social profiles | ||
| assert(Array.isArray(orgSchema.sameAs), 'Organization should have sameAs array'); | ||
| assert.strictEqual(orgSchema.sameAs.length, 2); | ||
| assert(orgSchema.sameAs.includes('https://twitter.com/examplecorp')); |
Check failure
Code scanning / CodeQL
Incomplete URL substring sanitization High test
| assert(Array.isArray(orgSchema.sameAs), 'Organization should have sameAs array'); | ||
| assert.strictEqual(orgSchema.sameAs.length, 2); | ||
| assert(orgSchema.sameAs.includes('https://twitter.com/examplecorp')); | ||
| assert(orgSchema.sameAs.includes('https://linkedin.com/company/example')); |
Check failure
Code scanning / CodeQL
Incomplete URL substring sanitization High test
| llmsTxt.includes('A test site demonstrating llms.txt'), | ||
| 'Should include site description' | ||
| ); | ||
| assert(llmsTxt.includes('https://example.com'), 'Should include base URL'); |
Check failure
Code scanning / CodeQL
Incomplete URL substring sanitization High test
Summary
Summarize the changes briefly, including which issue/ticket this resolves. If it closes an existing Github issue, include "Closes #[issue number]"
This PR adds structured data creation and improved robots.txt generation, along with an llms.txt generation. The module has also been structured to allow easy addition of new schema at the project level. The README has been updated for the new features.
What are the specific steps to test this change?
For example:
What kind of change does this PR introduce?
(Check at least one)
Make sure the PR fulfills these requirements:
If adding a new feature without an already open issue, it's best to open a feature request issue first and wait for approval before working on it.
Other information: