Skip to content

Conversation

@arthurpassos
Copy link
Collaborator

@arthurpassos arthurpassos commented Nov 5, 2025

Changelog category (leave one):

  • New Feature

Changelog entry (a user-readable short description of the changes that goes to CHANGELOG.md):

Export partition support for replicated mt engines

Documentation entry for user-facing changes

...

CI/CD Options

Exclude tests:

  • Fast test
  • Integration Tests
  • Stateless tests
  • Stateful tests
  • Performance tests
  • All with ASAN
  • All with TSAN
  • All with MSAN
  • All with UBSAN
  • All with Coverage
  • All with Aarch64
  • All Regression
  • Disable CI Cache

Regression jobs to run:

  • Fast suites (mostly <1h)
  • Aggregate Functions (2h)
  • Alter (1.5h)
  • Benchmark (30m)
  • ClickHouse Keeper (1h)
  • Iceberg (2h)
  • LDAP (1h)
  • Parquet (1.5h)
  • RBAC (1.5h)
  • SSL Server (1h)
  • S3 (2h)
  • Tiered Storage (2h)

@github-actions
Copy link

github-actions bot commented Nov 5, 2025

Workflow [PR], commit [cf13ec2]

@arthurpassos arthurpassos changed the title [DRAFT] Yet another export replicated partition pr Yet another export replicated partition pr Nov 5, 2025
@arthurpassos arthurpassos force-pushed the export_replicated_mt_partition_v2 branch from c768bca to fb2d7f7 Compare November 5, 2025 11:52
@arthurpassos
Copy link
Collaborator Author

arthurpassos commented Nov 6, 2025

Discussion from today:

Let's talk about the design:

Disclaimer: it does not need to be the most optimized one. Premature optimization killed us in the past, it'll kill us again.

<replica_zk_path>/exports <-- new path to store exports
    <export_key> <-- partition_id + destination_id, does not allow duplicates
        status <-- COMPLETED
        metadata.json <-- immutable stuff: t_id, p_id, destination, src_replica, parts_count, part_names, create_time, max_retries, ttl_seconds (Remember to add parallelism control here)
        processing <-- znode that holds pending, in_progress or failed parts.
            <part_1>
                status <-- PENDING or FAILED
                retry_count
                <finished_by> <-- it only exists for failed
            ...
            <part_n>
        processed <-- znode that holds parts that have been successfully completed
            <part_1>
                path <-- relative path in destination storage
                status <-- I guess I don't need this
                finished_by
        locks
            part_1 <-- replica1, ephemeral
            ...
            part_n <-- replican
        exceptions_per_replica
            <replica1> <-- created upon demand
                last_exception
                    part
                    exception <-- message
                count <--- problematic.. maybe it can be solved with a simple lock since it is exclusive to this replica
            ...
            <replica_n>
<replica_zk_path>/exports_cleanup_lock <-- ephemeral


talk about no need for CAS loops (actually we do need one for exceptions - I'll have to rethink this)
talk about the commit phase (moving last part to processed + checking if should commit)
talk about exporting and no longer holding a lock
talk about remove recursive operations (gather code evidence) <-- cover later
talk about system.replicated_partition_exports being expensive
talk about reducing communication with zk, keeping manifests local
talk about order based on create time


later discuss about kill operation
consider bumping retry count at pick up moment

Algorithm:

export_request:

sanity_check() // do we have destination table, partition id, task entry does not exist (unless force or ttl expired)
create_zk_export_structure() <-- transactional
/// note it does not trigger any export operation at this point, just puts it in zk

manifest_list_updating_task:

Load new entries
If we have the cleanup lock, also remove stale entries from zk and local
Upload dangling commit files if any
trigger scheduling task


scheduling_task:
loop over local manifests
    sanity_check // do we have destination table, is it still pending (zk calls)
    get_list_of_pending_parts (processing directory)
    get_list_of_locks

    loop over pending parts
        skip if we dont have the part
        skip if it is already locked (lock list is already local)

        try_to_lock
        try_to_schedule <-- we might need to optimize here

part_export_success_callback:
    check if we still own the lock, and grab the field version
    move it to processed with appropriate fields

    /// below is not transactional
    check if processing is empty (then we need to commit)
    ship commit file to s3
    mark entire task as completed

part_export_fail_callback:
    check if we still own the lock, grab the version
    grab retry_count from zk
    if retry_count + 1 >= max_retries
        set part under processing as failed
        fail the entire task
    increase retry_count
    populate exceptions_per_replica

cleanup_explain



recursivev looksup
kill



replicated_partition_exports

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants