Skip to content

fix: filter sitemap-derived URLs by enqueue strategy#3797

Open
vdusek wants to merge 4 commits into
masterfrom
fix/sitemap-enqueue-strategy
Open

fix: filter sitemap-derived URLs by enqueue strategy#3797
vdusek wants to merge 4 commits into
masterfrom
fix/sitemap-enqueue-strategy

Conversation

@vdusek

@vdusek vdusek commented Jun 30, 2026

Copy link
Copy Markdown
Contributor

SitemapRequestList, Sitemap.load / parseSitemap, and RobotsTxtFile.getSitemaps() now apply an enqueue strategy to the URLs they extract. They keep only entries that match the strategy (default same-hostname) relative to their parent sitemap, and always drop non-http(s) schemes. This brings sitemap loading in line with enqueueLinks, which already scopes discovered links to the same hostname by default.

The selected strategy is also stamped onto the emitted Request objects, so it keeps being enforced after navigation (e.g. across redirects).

This changes the default behavior: cross-host sitemap entries are no longer enqueued unless you opt in with enqueueStrategy: 'all' (or 'same-domain' / 'same-origin').

@vdusek vdusek added adhoc Ad-hoc unplanned task added during the sprint. t-tooling Issues with this label are in the ownership of the tooling team. labels Jun 30, 2026
@vdusek vdusek self-assigned this Jun 30, 2026
Comment thread packages/core/src/storages/sitemap_request_list.ts Outdated
export type SearchParams = string | URLSearchParams | Record<string, string | number | boolean | null | undefined>;

/** Enqueue strategy values, mirroring the `EnqueueStrategy` enum from `@crawlee/core` (which `@crawlee/utils` can't import). */
export type EnqueueStrategyValue = 'all' | 'same-hostname' | 'same-domain' | 'same-origin';

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same here

Suggested change
export type EnqueueStrategyValue = 'all' | 'same-hostname' | 'same-domain' | 'same-origin';
export type EnqueueStrategyValue = `${EnqueueStrategy}`;

i would rather inline this instead of introducing a new exported type for it

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not possible because crawlee/core already depends on crawlee/utils, so that would be a circular reference. Am I correct? 🙂

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

type-level cycles are fine

Comment thread packages/utils/package.json
@vdusek

vdusek commented Jun 30, 2026

Copy link
Copy Markdown
Contributor Author

@B4nan FYI; this is still a draft - I haven't reviewed most of it yet

@B4nan

B4nan commented Jun 30, 2026

Copy link
Copy Markdown
Member

no worries, i was just curious 🙃

@vdusek vdusek requested a review from B4nan July 1, 2026 10:50
@vdusek vdusek marked this pull request as ready for review July 1, 2026 10:50
@vdusek vdusek requested review from janbuchar and removed request for B4nan July 1, 2026 10:54
@vdusek

vdusek commented Jul 1, 2026

Copy link
Copy Markdown
Contributor Author

Requesting a review from @janbuchar, since you may have additional context from the Python solution.

@janbuchar janbuchar left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

adhoc Ad-hoc unplanned task added during the sprint. t-tooling Issues with this label are in the ownership of the tooling team.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants