Skip to content

Conversation

@SuaYoo
Copy link
Member

@SuaYoo SuaYoo commented Oct 28, 2025

Resolves #2863

Changes

Manual testing

Screenshots

Page Image/video

Fixes #2867 

The backend implementation involves:

Operator
- A new CollIndex CRD type, btrix-crds updated to 0.2.0
- Operator that manages the new CRD type, creating a new Redis instance
when the index should exist (uses redis_dedupe_memory and redis_dedupe_storage chart values)
- dedupe_importer_channel can configure crawler channel for index imports
- Operator starts the crawler in 'indexer' mode 

Workflows & Crawls:
- Workflows have a new 'dedupeCollId' field for dedupe while crawling
The `dedupeCollId` must also be a collection that the crawl is
auto-added to.
- There is a new waiting state: `waiting_for_dedupe_index` that is
entered if a crawl is starting, but index is not yet ready.
- Each crawl has bi-directional links for crawls that it requires for
dedupe via `requiresCrawls` and other crawls for which this crawl is
required via `requiredByCrawls`.
- autoAddCollections automatically updated to always include
`dedupeCollId` collection.

Collection:
- Collection has a new `hasDedupeIndex` field
- Items added/removed to/from collection result in marking CollIndex object for updates by updating collItemsUpdatedAt timestamp to trigger a reindex
- CollIndex object deleted on collection delete

For indexing, dependent on version of crawler from 
webrecorder/browsertrix-crawler#884
that supports indexing mode.

---------
Co-authored-by: Tessa Walsh <[email protected]>
@SuaYoo SuaYoo force-pushed the feature-dedupe--frontend-org branch from 1c2faf1 to 2203bb9 Compare November 3, 2025 17:02
SuaYoo and others added 6 commits November 3, 2025 09:12
- Adds new "Deduplication" section to workflows
- Allows users to use a collection for deduplication
- Various refactors for consistency
@SuaYoo SuaYoo force-pushed the feature-dedupe--frontend-org branch from 2203bb9 to 8bbcc41 Compare November 3, 2025 17:13
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants