Incremental compaction #32381

DAlperin · 2025-04-30T18:09:02Z

Towards https://github.com/MaterializeInc/database-issues/issues/9191

Today, we have no good way to split the work of compaction into smaller parts. This presents an issue as datasets and clusters continue to grow in size. If a compaction takes a significant amount of time there is a risk that the process running the compaction might not live long enough (for whatever reason: failure, shutdown, schedule, etc).

This PR aims to improve the situation when dealing with compacting many shorter runs. We already split the work up into "chunks" based on the size of the runs but we don't write the work back out into state until all chunks are complete. This is suboptimal. Imagine a big amount of compaction is chugging along, 99 of the 100 batches of work are done, but before the last one can finish the cluster shuts down. All that work is wasted.

This PR "checkpoints" it's work into state after each chunk is done. That way in the example above, only the partially finished 100th chunk is lost. (Incremental work within chunks will be the subject of future work).

There is a tradeoff here though, it means writing to state more often, this risks putting CRDB under additional load. We currently seem to execute 650-750 writes per second to each of our CRDB nodes in us-east-1 on average. There is significant potential risk here. In us-east-1, on the order of 200 chunks per second are queued up. That means that if each chunk completes immediately and concurrently, we significantly push the QPS of our crdb cluster (I think our cluster can handle it based on resource usage I'm seeing but setting that aside...) I don't think that every chunk across every environment is going to complete immediately and concurrently so I think the likely impact on the QPS is likely to be lower than 200/s. That said we don't have a sense of per chunk timing so it's harder to estimate specifically. An anecdotal test in staging didn't reveal any undue load.

If this remains a concern, some form of backpressure could be implemented to batch applies.

Motivation

Tips for reviewer

Checklist

This PR has adequate test coverage / QA involvement has been duly considered. (trigger-ci for additional test/nightly runs)
This PR has an associated up-to-date design doc, is a design doc (template), or is sufficiently small to not require a design.
If this PR evolves an existing $T ⇔ Proto$T mapping (possibly in a backwards-incompatible way), then it is tagged with a T-proto label.
If this PR will require changes to cloud orchestration or tests, there is a companion cloud PR to account for those changes that is tagged with the release-blocker label (example).
If this PR includes major user-facing behavior changes, I have pinged the relevant PM to schedule a changelog post.

bkirwi

A few minor things, but the overall shape looks good! The compaction stream seems very tidy.

src/persist-client/src/internal/compact.rs

src/persist-client/src/internal/machine.rs

src/persist-client/src/internal/trace.rs

src/persist-client/tests/minimal/compaction_bounded

bkirwi · 2025-05-06T17:45:56Z

The performance / load analysis you have seems reasonable! If you're adding a compact_all or whatever method anyways to clean up the duplicated code, it might be easy and smart to add a flag that lets us disable the new behaviour if we notice performance issues in prod. (We can probably get away without it if that's hard, but I have a feeling it will be not so hard with that additional method in.)

DAlperin · 2025-05-06T19:19:25Z

Good call, added a flag to turn it off if need be.

bkirwi

Great - thanks for the changes!

Let's run a set of nightlies on this, but otherwise I think you're good to go.

This reverts commit 69e7465.

[persist] Revert "Incremental compaction (#32381)"

This reverts commit 69e7465.

Towards MaterializeInc/database-issues#9191 Today, we have no good way to split the work of compaction into smaller parts. This presents an issue as datasets and clusters continue to grow in size. If a compaction takes a significant amount of time there is a risk that the process running the compaction might not live long enough (for whatever reason: failure, shutdown, schedule, etc). This PR aims to improve the situation when dealing with compacting many shorter runs. We already split the work up into "chunks" based on the size of the runs but we don't write the work back out into state until all chunks are complete. This is suboptimal. Imagine a big amount of compaction is chugging along, 99 of the 100 batches of work are done, but before the last one can finish the cluster shuts down. All that work is wasted. This PR "checkpoints" it's work into state after each chunk is done. That way in the example above, only the partially finished 100th chunk is lost. (Incremental work within chunks will be the subject of future work). There is a tradeoff here though, it means writing to state more often, this risks putting CRDB under additional load. We currently seem to execute 650-750 writes per second to each of our CRDB nodes in us-east-1 on average. There is significant potential risk here. In us-east-1, on the order of 200 chunks per second are queued up. That means that if each chunk completes immediately and concurrently, we significantly push the QPS of our crdb cluster (I think our cluster can handle it based on resource usage I'm seeing but setting that aside...) I don't think that every chunk across every environment is going to complete immediately and concurrently so I think the likely impact on the QPS is likely to be lower than 200/s. That said we don't have a sense of _per chunk_ timing so it's harder to estimate specifically. An anecdotal test in staging didn't reveal any undue load. If this remains a concern, some form of backpressure could be implemented to batch applies.  ### Motivation  ### Tips for reviewer  ### Checklist - [ ] This PR has adequate test coverage / QA involvement has been duly considered. ([trigger-ci for additional test/nightly runs](https://trigger-ci.dev.materialize.com/)) - [ ] This PR has an associated up-to-date [design doc](https://github.com/MaterializeInc/materialize/blob/main/doc/developer/design/README.md), is a design doc ([template](https://github.com/MaterializeInc/materialize/blob/main/doc/developer/design/00000000_template.md)), or is sufficiently small to not require a design.  - [ ] If this PR evolves [an existing `$T ⇔ Proto$T` mapping](https://github.com/MaterializeInc/materialize/blob/main/doc/developer/command-and-response-binary-encoding.md) (possibly in a backwards-incompatible way), then it is tagged with a `T-proto` label. - [ ] If this PR will require changes to cloud orchestration or tests, there is a companion cloud PR to account for those changes that is tagged with the release-blocker label ([example](MaterializeInc/cloud#5021)).  - [ ] If this PR includes major [user-facing behavior changes](https://github.com/MaterializeInc/materialize/blob/main/doc/developer/guide-changes.md#what-changes-require-a-release-note), I have pinged the relevant PM to schedule a changelog post.

This reverts commit 5e72e36.

Towards MaterializeInc/database-issues#9191 Today, we have no good way to split the work of compaction into smaller parts. This presents an issue as datasets and clusters continue to grow in size. If a compaction takes a significant amount of time there is a risk that the process running the compaction might not live long enough (for whatever reason: failure, shutdown, schedule, etc). This PR aims to improve the situation when dealing with compacting many shorter runs. We already split the work up into "chunks" based on the size of the runs but we don't write the work back out into state until all chunks are complete. This is suboptimal. Imagine a big amount of compaction is chugging along, 99 of the 100 batches of work are done, but before the last one can finish the cluster shuts down. All that work is wasted. This PR "checkpoints" it's work into state after each chunk is done. That way in the example above, only the partially finished 100th chunk is lost. (Incremental work within chunks will be the subject of future work). There is a tradeoff here though, it means writing to state more often, this risks putting CRDB under additional load. We currently seem to execute 650-750 writes per second to each of our CRDB nodes in us-east-1 on average. There is significant potential risk here. In us-east-1, on the order of 200 chunks per second are queued up. That means that if each chunk completes immediately and concurrently, we significantly push the QPS of our crdb cluster (I think our cluster can handle it based on resource usage I'm seeing but setting that aside...) I don't think that every chunk across every environment is going to complete immediately and concurrently so I think the likely impact on the QPS is likely to be lower than 200/s. That said we don't have a sense of _per chunk_ timing so it's harder to estimate specifically. An anecdotal test in staging didn't reveal any undue load. If this remains a concern, some form of backpressure could be implemented to batch applies.    - [ ] This PR has adequate test coverage / QA involvement has been duly considered. ([trigger-ci for additional test/nightly runs](https://trigger-ci.dev.materialize.com/)) - [ ] This PR has an associated up-to-date [design doc](https://github.com/MaterializeInc/materialize/blob/main/doc/developer/design/README.md), is a design doc ([template](https://github.com/MaterializeInc/materialize/blob/main/doc/developer/design/00000000_template.md)), or is sufficiently small to not require a design.  - [ ] If this PR evolves [an existing `$T ⇔ Proto$T` mapping](https://github.com/MaterializeInc/materialize/blob/main/doc/developer/command-and-response-binary-encoding.md) (possibly in a backwards-incompatible way), then it is tagged with a `T-proto` label. - [ ] If this PR will require changes to cloud orchestration or tests, there is a companion cloud PR to account for those changes that is tagged with the release-blocker label ([example](MaterializeInc/cloud#5021)).  - [ ] If this PR includes major [user-facing behavior changes](https://github.com/MaterializeInc/materialize/blob/main/doc/developer/guide-changes.md#what-changes-require-a-release-note), I have pinged the relevant PM to schedule a changelog post.

DAlperin added 9 commits April 28, 2025 17:14

First pass at incremental compaction

3ac8fd2

fix build

2b6ccc6

improve async handling of compaction task

3fa703a

oops

3416e0a

channels everywhere ❇️

702d2a9

refactor compaction to use streaming

0e061c0

clean up

a915ae8

fix clippy

731de20

remove old compaction code

3701a9b

DAlperin marked this pull request as ready for review May 5, 2025 20:03

DAlperin requested a review from a team as a code owner May 5, 2025 20:03

bkirwi reviewed May 6, 2025

View reviewed changes

DAlperin force-pushed the dov/persist-incremental-compaction branch from bc9fabd to 61a7e2c Compare May 6, 2025 19:18

DAlperin force-pushed the dov/persist-incremental-compaction branch from 61a7e2c to b83b967 Compare May 6, 2025 20:35

Address PR feedback

417bc5b

DAlperin force-pushed the dov/persist-incremental-compaction branch from b83b967 to 417bc5b Compare May 6, 2025 21:18

bkirwi approved these changes May 6, 2025

View reviewed changes

DAlperin merged commit 69e7465 into MaterializeInc:main May 7, 2025
253 checks passed

bkirwi added a commit to bkirwi/materialize that referenced this pull request May 26, 2025

Revert "Incremental compaction (MaterializeInc#32381)"

4cad141

This reverts commit 69e7465.

bkirwi added a commit that referenced this pull request May 26, 2025

Merge pull request #32589 from bkirwi/revert-inc

b01b5a5

[persist] Revert "Incremental compaction (#32381)"

ggevay pushed a commit that referenced this pull request May 27, 2025

Revert "Incremental compaction (#32381)"

2878a38

This reverts commit 69e7465.

DAlperin added a commit to DAlperin/materialize that referenced this pull request May 28, 2025

Revert "Incremental compaction (MaterializeInc#32381)"

d4a684b

This reverts commit 5e72e36.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Incremental compaction #32381

Incremental compaction #32381

Uh oh!

DAlperin commented Apr 30, 2025 •

edited by bkirwi

Loading

Uh oh!

bkirwi left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

bkirwi commented May 6, 2025

Uh oh!

DAlperin commented May 6, 2025

Uh oh!

bkirwi left a comment

Uh oh!

Uh oh!

Uh oh!

Incremental compaction #32381

Incremental compaction #32381

Uh oh!

Conversation

DAlperin commented Apr 30, 2025 • edited by bkirwi Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

Tips for reviewer

Checklist

Uh oh!

bkirwi left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

bkirwi commented May 6, 2025

Uh oh!

DAlperin commented May 6, 2025

Uh oh!

bkirwi left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

DAlperin commented Apr 30, 2025 •

edited by bkirwi

Loading