Skip to content

Incremental compaction #32381

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged

Conversation

DAlperin
Copy link
Member

@DAlperin DAlperin commented Apr 30, 2025

Towards https://github.com/MaterializeInc/database-issues/issues/9191

Today, we have no good way to split the work of compaction into smaller parts. This presents an issue as datasets and clusters continue to grow in size. If a compaction takes a significant amount of time there is a risk that the process running the compaction might not live long enough (for whatever reason: failure, shutdown, schedule, etc).

This PR aims to improve the situation when dealing with compacting many shorter runs. We already split the work up into "chunks" based on the size of the runs but we don't write the work back out into state until all chunks are complete. This is suboptimal. Imagine a big amount of compaction is chugging along, 99 of the 100 batches of work are done, but before the last one can finish the cluster shuts down. All that work is wasted.

This PR "checkpoints" it's work into state after each chunk is done. That way in the example above, only the partially finished 100th chunk is lost. (Incremental work within chunks will be the subject of future work).

There is a tradeoff here though, it means writing to state more often, this risks putting CRDB under additional load. We currently seem to execute 650-750 writes per second to each of our CRDB nodes in us-east-1 on average. There is significant potential risk here. In us-east-1, on the order of 200 chunks per second are queued up. That means that if each chunk completes immediately and concurrently, we significantly push the QPS of our crdb cluster (I think our cluster can handle it based on resource usage I'm seeing but setting that aside...) I don't think that every chunk across every environment is going to complete immediately and concurrently so I think the likely impact on the QPS is likely to be lower than 200/s. That said we don't have a sense of per chunk timing so it's harder to estimate specifically. An anecdotal test in staging didn't reveal any undue load.

If this remains a concern, some form of backpressure could be implemented to batch applies.

Motivation

Tips for reviewer

Checklist

  • This PR has adequate test coverage / QA involvement has been duly considered. (trigger-ci for additional test/nightly runs)
  • This PR has an associated up-to-date design doc, is a design doc (template), or is sufficiently small to not require a design.
  • If this PR evolves an existing $T ⇔ Proto$T mapping (possibly in a backwards-incompatible way), then it is tagged with a T-proto label.
  • If this PR will require changes to cloud orchestration or tests, there is a companion cloud PR to account for those changes that is tagged with the release-blocker label (example).
  • If this PR includes major user-facing behavior changes, I have pinged the relevant PM to schedule a changelog post.

@DAlperin DAlperin marked this pull request as ready for review May 5, 2025 20:03
@DAlperin DAlperin requested a review from a team as a code owner May 5, 2025 20:03
Copy link
Contributor

@bkirwi bkirwi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few minor things, but the overall shape looks good! The compaction stream seems very tidy.

@bkirwi
Copy link
Contributor

bkirwi commented May 6, 2025

The performance / load analysis you have seems reasonable! If you're adding a compact_all or whatever method anyways to clean up the duplicated code, it might be easy and smart to add a flag that lets us disable the new behaviour if we notice performance issues in prod. (We can probably get away without it if that's hard, but I have a feeling it will be not so hard with that additional method in.)

@DAlperin DAlperin force-pushed the dov/persist-incremental-compaction branch from bc9fabd to 61a7e2c Compare May 6, 2025 19:18
@DAlperin
Copy link
Member Author

DAlperin commented May 6, 2025

Good call, added a flag to turn it off if need be.

@DAlperin DAlperin force-pushed the dov/persist-incremental-compaction branch from 61a7e2c to b83b967 Compare May 6, 2025 20:35
@DAlperin DAlperin force-pushed the dov/persist-incremental-compaction branch from b83b967 to 417bc5b Compare May 6, 2025 21:18
Copy link
Contributor

@bkirwi bkirwi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great - thanks for the changes!

Let's run a set of nightlies on this, but otherwise I think you're good to go.

@DAlperin DAlperin merged commit 69e7465 into MaterializeInc:main May 7, 2025
253 checks passed
bkirwi added a commit to bkirwi/materialize that referenced this pull request May 26, 2025
bkirwi added a commit that referenced this pull request May 26, 2025
[persist] Revert "Incremental compaction (#32381)"
ggevay pushed a commit that referenced this pull request May 27, 2025
DAlperin added a commit to DAlperin/materialize that referenced this pull request May 28, 2025
Towards MaterializeInc/database-issues#9191

Today, we have no good way to split the work of compaction into smaller
parts. This presents an issue as datasets and clusters continue to grow
in size. If a compaction takes a significant amount of time there is a
risk that the process running the compaction might not live long enough
(for whatever reason: failure, shutdown, schedule, etc).

This PR aims to improve the situation when dealing with compacting many
shorter runs. We already split the work up into "chunks" based on the
size of the runs but we don't write the work back out into state until
all chunks are complete. This is suboptimal. Imagine a big amount of
compaction is chugging along, 99 of the 100 batches of work are done,
but before the last one can finish the cluster shuts down. All that work
is wasted.

This PR "checkpoints" it's work into state after each chunk is done.
That way in the example above, only the partially finished 100th chunk
is lost. (Incremental work within chunks will be the subject of future
work).

There is a tradeoff here though, it means writing to state more often,
this risks putting CRDB under additional load. We currently seem to
execute 650-750 writes per second to each of our CRDB nodes in us-east-1
on average. There is significant potential risk here. In us-east-1, on
the order of 200 chunks per second are queued up. That means that if
each chunk completes immediately and concurrently, we significantly push
the QPS of our crdb cluster (I think our cluster can handle it based on
resource usage I'm seeing but setting that aside...) I don't think that
every chunk across every environment is going to complete immediately
and concurrently so I think the likely impact on the QPS is likely to be
lower than 200/s. That said we don't have a sense of _per chunk_ timing
so it's harder to estimate specifically. An anecdotal test in staging
didn't reveal any undue load.

If this remains a concern, some form of backpressure could be
implemented to batch applies.
<!--
Describe the contents of the PR briefly but completely.

If you write detailed commit messages, it is acceptable to copy/paste
them
here, or write "see commit messages for details." If there is only one
commit
in the PR, GitHub will have already added its commit message above.
-->

### Motivation

<!--
Which of the following best describes the motivation behind this PR?

  * This PR fixes a recognized bug.

    [Ensure issue is linked somewhere.]

  * This PR adds a known-desirable feature.

    [Ensure issue is linked somewhere.]

  * This PR fixes a previously unreported bug.

    [Describe the bug in detail, as if you were filing a bug report.]

  * This PR adds a feature that has not yet been specified.

[Write a brief specification for the feature, including justification
for its inclusion in Materialize, as if you were writing the original
     feature specification.]

   * This PR refactors existing code.

[Describe what was wrong with the existing code, if it is not obvious.]
-->

### Tips for reviewer

<!--
Leave some tips for your reviewer, like:

    * The diff is much smaller if viewed with whitespace hidden.
    * [Some function/module/file] deserves extra attention.
* [Some function/module/file] is pure code movement and only needs a
skim.

Delete this section if no tips.
-->

### Checklist

- [ ] This PR has adequate test coverage / QA involvement has been duly
considered. ([trigger-ci for additional test/nightly
runs](https://trigger-ci.dev.materialize.com/))
- [ ] This PR has an associated up-to-date [design
doc](https://github.com/MaterializeInc/materialize/blob/main/doc/developer/design/README.md),
is a design doc
([template](https://github.com/MaterializeInc/materialize/blob/main/doc/developer/design/00000000_template.md)),
or is sufficiently small to not require a design.
  <!-- Reference the design in the description. -->
- [ ] If this PR evolves [an existing `$T ⇔ Proto$T`
mapping](https://github.com/MaterializeInc/materialize/blob/main/doc/developer/command-and-response-binary-encoding.md)
(possibly in a backwards-incompatible way), then it is tagged with a
`T-proto` label.
- [ ] If this PR will require changes to cloud orchestration or tests,
there is a companion cloud PR to account for those changes that is
tagged with the release-blocker label
([example](MaterializeInc/cloud#5021)).
<!-- Ask in #team-cloud on Slack if you need help preparing the cloud
PR. -->
- [ ] If this PR includes major [user-facing behavior
changes](https://github.com/MaterializeInc/materialize/blob/main/doc/developer/guide-changes.md#what-changes-require-a-release-note),
I have pinged the relevant PM to schedule a changelog post.
DAlperin added a commit to DAlperin/materialize that referenced this pull request May 28, 2025
DAlperin added a commit to DAlperin/materialize that referenced this pull request May 28, 2025
Towards MaterializeInc/database-issues#9191

Today, we have no good way to split the work of compaction into smaller
parts. This presents an issue as datasets and clusters continue to grow
in size. If a compaction takes a significant amount of time there is a
risk that the process running the compaction might not live long enough
(for whatever reason: failure, shutdown, schedule, etc).

This PR aims to improve the situation when dealing with compacting many
shorter runs. We already split the work up into "chunks" based on the
size of the runs but we don't write the work back out into state until
all chunks are complete. This is suboptimal. Imagine a big amount of
compaction is chugging along, 99 of the 100 batches of work are done,
but before the last one can finish the cluster shuts down. All that work
is wasted.

This PR "checkpoints" it's work into state after each chunk is done.
That way in the example above, only the partially finished 100th chunk
is lost. (Incremental work within chunks will be the subject of future
work).

There is a tradeoff here though, it means writing to state more often,
this risks putting CRDB under additional load. We currently seem to
execute 650-750 writes per second to each of our CRDB nodes in us-east-1
on average. There is significant potential risk here. In us-east-1, on
the order of 200 chunks per second are queued up. That means that if
each chunk completes immediately and concurrently, we significantly push
the QPS of our crdb cluster (I think our cluster can handle it based on
resource usage I'm seeing but setting that aside...) I don't think that
every chunk across every environment is going to complete immediately
and concurrently so I think the likely impact on the QPS is likely to be
lower than 200/s. That said we don't have a sense of _per chunk_ timing
so it's harder to estimate specifically. An anecdotal test in staging
didn't reveal any undue load.

If this remains a concern, some form of backpressure could be
implemented to batch applies.
<!--
Describe the contents of the PR briefly but completely.

If you write detailed commit messages, it is acceptable to copy/paste
them
here, or write "see commit messages for details." If there is only one
commit
in the PR, GitHub will have already added its commit message above.
-->

<!--
Which of the following best describes the motivation behind this PR?

  * This PR fixes a recognized bug.

    [Ensure issue is linked somewhere.]

  * This PR adds a known-desirable feature.

    [Ensure issue is linked somewhere.]

  * This PR fixes a previously unreported bug.

    [Describe the bug in detail, as if you were filing a bug report.]

  * This PR adds a feature that has not yet been specified.

[Write a brief specification for the feature, including justification
for its inclusion in Materialize, as if you were writing the original
     feature specification.]

   * This PR refactors existing code.

[Describe what was wrong with the existing code, if it is not obvious.]
-->

<!--
Leave some tips for your reviewer, like:

    * The diff is much smaller if viewed with whitespace hidden.
    * [Some function/module/file] deserves extra attention.
* [Some function/module/file] is pure code movement and only needs a
skim.

Delete this section if no tips.
-->

- [ ] This PR has adequate test coverage / QA involvement has been duly
considered. ([trigger-ci for additional test/nightly
runs](https://trigger-ci.dev.materialize.com/))
- [ ] This PR has an associated up-to-date [design
doc](https://github.com/MaterializeInc/materialize/blob/main/doc/developer/design/README.md),
is a design doc
([template](https://github.com/MaterializeInc/materialize/blob/main/doc/developer/design/00000000_template.md)),
or is sufficiently small to not require a design.
  <!-- Reference the design in the description. -->
- [ ] If this PR evolves [an existing `$T ⇔ Proto$T`
mapping](https://github.com/MaterializeInc/materialize/blob/main/doc/developer/command-and-response-binary-encoding.md)
(possibly in a backwards-incompatible way), then it is tagged with a
`T-proto` label.
- [ ] If this PR will require changes to cloud orchestration or tests,
there is a companion cloud PR to account for those changes that is
tagged with the release-blocker label
([example](MaterializeInc/cloud#5021)).
<!-- Ask in #team-cloud on Slack if you need help preparing the cloud
PR. -->
- [ ] If this PR includes major [user-facing behavior
changes](https://github.com/MaterializeInc/materialize/blob/main/doc/developer/guide-changes.md#what-changes-require-a-release-note),
I have pinged the relevant PM to schedule a changelog post.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants