Skip to content

Conversation

@annrpom
Copy link
Contributor

@annrpom annrpom commented Oct 7, 2025

We will add a configurable option, LatencyTolerantSpanPolicy, that
specifies the minimum value size required for latency tolerant spans
and likely MVCC garbage to be separated into a blob file.

Fixes: #5377

@cockroach-teamcity
Copy link
Member

This change is Reviewable

@annrpom annrpom force-pushed the tunable-min-size branch 2 times, most recently from c0ab56d to 3e4eb3e Compare October 7, 2025 20:31
@annrpom annrpom changed the title db: add configurable setting for unique sep scenarios db: add configurable LatencyTolerantSpanPolicy option Oct 7, 2025
@annrpom annrpom marked this pull request as ready for review October 8, 2025 15:34
@annrpom annrpom requested a review from a team as a code owner October 8, 2025 15:34
@annrpom annrpom requested a review from sumeerbhola October 8, 2025 15:34
Copy link
Collaborator

@sumeerbhola sumeerbhola left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 0 of 17 files reviewed, 1 unresolved discussion


options.go line 822 at r1 (raw file):

		// be separated into a blob file when the value is likely MVCC garbage
		// or a part of the latency tolerant span key range.
		LatencyTolerantSpanPolicy func() compact.ValueSeparationOutputConfig

I am always a bit confused about our configuration approach to value separation.

  • We have a globally applicable ValueSeparationPolicy with many parameters. This is very reasonable.
  • We tweak it in two ways (a) no value separation, (b) the minimum size, based on the ValueStoragePolicy enum in the SpanPolicy. This tweaking happens in code, instead of the SpanPolicy explicitly specifying the pair (doValueSeparation, minimumSize), where both of the elements in that tuple would allow a special value that says "use global setting". In general, mapping things from enum into configs is probably best left to the highest layer (like CockroachDB).

Now we are adding another option that makes that minimum size configurable and it uses a func, when we already have a dynamic ValueSeparationPolicy via Options.Experimental.ValueSeparationPolicy func and we have a dynamic SpanPolicy to override the global behavior via SpanPolicyFunc. I think this is making the configuration more confusing. And we seem to have decided that the minimum size should be the same for latency tolerant and MVCC garbage -- that decision is better left to CockroachDB.

How about something like:

type ValueSeparationPolicy struct {
   ...
   MinimumSize int
   // Applies only to SpanPolicys that permit separation of MVCC garbage, which is also the default. 
   MinimumMVCCGarbageSize int
    ...
}

type SpanPolicy struct {
   ...
   ValueStoragePolicy ValueStoragePolicy
}

type ValueStoragePolicy {
   PolicyAdjustment ValueStoragePolicyAdjustment
   // Remaining fields are ignored, unless the adjustment is Override.
   HasMVCCGarbage bool
   MinimumSize int
   MinimumMVCCGarbageSize int.
}

// Possibe values are UseDefault, NoValueSeparation, Override.
type ValueStoragePolicyAdjustment uint8

@jbowens

@RaduBerinde
Copy link
Member

options.go line 822 at r1 (raw file):

Previously, sumeerbhola wrote…

I am always a bit confused about our configuration approach to value separation.

  • We have a globally applicable ValueSeparationPolicy with many parameters. This is very reasonable.
  • We tweak it in two ways (a) no value separation, (b) the minimum size, based on the ValueStoragePolicy enum in the SpanPolicy. This tweaking happens in code, instead of the SpanPolicy explicitly specifying the pair (doValueSeparation, minimumSize), where both of the elements in that tuple would allow a special value that says "use global setting". In general, mapping things from enum into configs is probably best left to the highest layer (like CockroachDB).

Now we are adding another option that makes that minimum size configurable and it uses a func, when we already have a dynamic ValueSeparationPolicy via Options.Experimental.ValueSeparationPolicy func and we have a dynamic SpanPolicy to override the global behavior via SpanPolicyFunc. I think this is making the configuration more confusing. And we seem to have decided that the minimum size should be the same for latency tolerant and MVCC garbage -- that decision is better left to CockroachDB.

How about something like:

type ValueSeparationPolicy struct {
   ...
   MinimumSize int
   // Applies only to SpanPolicys that permit separation of MVCC garbage, which is also the default. 
   MinimumMVCCGarbageSize int
    ...
}

type SpanPolicy struct {
   ...
   ValueStoragePolicy ValueStoragePolicy
}

type ValueStoragePolicy {
   PolicyAdjustment ValueStoragePolicyAdjustment
   // Remaining fields are ignored, unless the adjustment is Override.
   HasMVCCGarbage bool
   MinimumSize int
   MinimumMVCCGarbageSize int.
}

// Possibe values are UseDefault, NoValueSeparation, Override.
type ValueStoragePolicyAdjustment uint8

@jbowens

I agree - Pebble can define a "recommended" low latency value for ValueStoragePolicy that CRDB can use to populate the field - but it should all be returned through the span policy.

@annrpom annrpom changed the title db: add configurable LatencyTolerantSpanPolicy option db: min size for mvcc separation should be configurable Oct 15, 2025
@annrpom annrpom marked this pull request as draft October 15, 2025 21:05
We will add a field, `MinimumMVCCGarbageSize`, that
specifies the minimum value size required for likely MVCC garbage to be
eligible for separation.

Fixes: cockroachdb#5377
Copy link
Contributor Author

@annrpom annrpom left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 0 of 26 files reviewed, 1 unresolved discussion (waiting on @jbowens, @RaduBerinde, and @sumeerbhola)


options.go line 822 at r1 (raw file):

Previously, RaduBerinde wrote…

I agree - Pebble can define a "recommended" low latency value for ValueStoragePolicy that CRDB can use to populate the field - but it should all be returned through the span policy.

@sumeerbhola done for the most part: what were you thinking for the HasMVCCGarbage field in the ValueStoragePolicy struct? i'm a bit confused by its purpose

@annrpom annrpom requested a review from sumeerbhola October 16, 2025 15:02
@annrpom annrpom marked this pull request as ready for review October 16, 2025 15:02
Copy link
Collaborator

@sumeerbhola sumeerbhola left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sumeerbhola reviewed 1 of 26 files at r2, 2 of 5 files at r3.
Reviewable status: 3 of 26 files reviewed, 3 unresolved discussions (waiting on @jbowens and @RaduBerinde)


options.go line 822 at r1 (raw file):

Previously, annrpom (annie pompa) wrote…

@sumeerbhola done for the most part: what were you thinking for the HasMVCCGarbage field in the ValueStoragePolicy struct? i'm a bit confused by its purpose

Perhaps I should have called IsMVCCKeys. It would be used to decided whether we consider multiple versioned keys to be MVCC and so can separate the garbage. I suspect we haven't needed it for the lock table keys since we say "NoValueSeparation" for them.


value_separation.go line 25 at r3 (raw file):

// will be separated into a blob file when the value storage policy is
// Override.
const latencyTolerantMinimumSize = 10

where is this used, outside of testing?
We should consider making this an exported const with comments that this is what we recommend users should use when constructing the ValueSeparationPolicy or ValueStoragePolicy structs.


value_separation.go line 45 at r3 (raw file):

	policy := d.opts.Experimental.ValueSeparationPolicy()
	if !policy.Enabled || valueStorage.PolicyAdjustment == NoValueSeparation {
		return compact.NeverSeparateValues{}

I am getting quite confused with determineCompactionValueSeparation. If there is a single policy for the whole compaction we will call it with that ValueStoragePolicy. In that sense deciding early on compact.NeverSeparateValues{} is probably worthwhile, though seems conceptually an optimization, since compaction.go also does the following:

		case ValueStorageLowReadLatency:
			vSep = compact.NeverSeparateValues{}

When there isn't a single policy, we used to pass ValueStorageDefault. Now we have a struct. The calling code of determine... seems unchanged

pebble/compaction.go

Lines 3448 to 3456 in df921f1

spanPolicy, spanPolicyEndKey, err = d.opts.Experimental.SpanPolicyFunc(c.bounds.Start)
if err != nil {
return runner.Finish().WithError(err)
}
if len(spanPolicyEndKey) == 0 || d.cmp(c.bounds.End.Key, spanPolicyEndKey) < 0 {
compactionValueStoragePolicy = spanPolicy.ValueStoragePolicy
}
valueSeparation := c.getValueSeparation(jobID, c, compactionValueStoragePolicy)
.

So we would pass a struct that says PolicyAdjustment=UseDefault but the other fields are 0? btw, we could use some unit testing of this many policies in the compaction case to ensure it is working as intended.
The other fields ought to be the default fields.

This also made me wonder why is the MinimumSize interesting enough to initialize now. We could remember the global minimum sizes and decide what to set to when SetNextOutputConfig is called. Which should be always called in the Override case (and the code seems to do that now).

Then I noticed preserveBlobReferences doesn't do anything in SetNextOutputConfig, so the preserveBlobReferences.MinimumSize return value is quite arbitrary, which we then fill into the sstable properties tw.SetValueSeparationProps. Of course, if we did do the override using SetNextOutputConfig it would still reflect the policy at the time of compaction and not what was in effect when the blob file was originally written. This seems incorrect.

And then we have the logic in shouldWriteBlobFiles which actually compares these using

		if backingProps.ValueSeparationMinSize != minimumValueSizeForCompaction {
				return true, 0
			}

This seems dubious since there could be many policies spanning the compaction interval and we are looking at one policy (perhaps the default) and it not everything matches we would rewrite the references?

@jbowens

@sumeerbhola sumeerbhola requested a review from jbowens October 20, 2025 18:19
Copy link
Collaborator

@jbowens jbowens left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 3 of 26 files reviewed, 3 unresolved discussions (waiting on @annrpom, @RaduBerinde, and @sumeerbhola)


options.go line 822 at r1 (raw file):

Previously, sumeerbhola wrote…

Perhaps I should have called IsMVCCKeys. It would be used to decided whether we consider multiple versioned keys to be MVCC and so can separate the garbage. I suspect we haven't needed it for the lock table keys since we say "NoValueSeparation" for them.

At the SpanPolicy level we currently have DisableValueSeparationBySuffix:

	// DisableValueSeparationBySuffix disables discriminating KVs depending on
	// suffix.
	//
	// Among a set of keys with the same prefix, Pebble's default heuristics
	// optimize access to the KV with the smallest suffix. This is useful for MVCC
	// keys (where the smallest suffix is the latest version), but should be
	// disabled for keys where the suffix does not correspond to a version.
	DisableValueSeparationBySuffix bool

Are you suggesting moving it into the ValueStoragePolicy? I think it's currently at the higher level because it also configures whether to separate values into value blocks within the same sstable. I don't really have an opinion where it lives.


value_separation.go line 45 at r3 (raw file):

When there isn't a single policy, we used to pass ValueStorageDefault. Now we have a struct. The calling code of determine... seems unchanged

I didn't follow this; I didn't see any changes from enum to struct.

This seems dubious since there could be many policies spanning the compaction interval and we are looking at one policy (perhaps the default) and it not everything matches we would rewrite the references?

We split sstables at new policies during Open, so we expect the vast majority of compactions to be within the same span policy. We establish the value separation of the entire compaction span upfront and pass that in:

        spanPolicy, spanPolicyEndKey, err = d.opts.Experimental.SpanPolicyFunc(c.bounds.Start)
	if err != nil {
		return runner.Finish().WithError(err)
	}
	if len(spanPolicyEndKey) == 0 || d.cmp(c.bounds.End.Key, spanPolicyEndKey) < 0 {
		compactionValueStoragePolicy = spanPolicy.ValueStoragePolicy
	}

This allows us to rewrite blob files if any of the inputs were encoded using a different policy.

(still parsing the comments here)

Copy link
Collaborator

@jbowens jbowens left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 3 of 26 files reviewed, 3 unresolved discussions (waiting on @annrpom, @jbowens, @RaduBerinde, and @sumeerbhola)


value_separation.go line 45 at r3 (raw file):

Previously, jbowens (Jackson Owens) wrote…

When there isn't a single policy, we used to pass ValueStorageDefault. Now we have a struct. The calling code of determine... seems unchanged

I didn't follow this; I didn't see any changes from enum to struct.

This seems dubious since there could be many policies spanning the compaction interval and we are looking at one policy (perhaps the default) and it not everything matches we would rewrite the references?

We split sstables at new policies during Open, so we expect the vast majority of compactions to be within the same span policy. We establish the value separation of the entire compaction span upfront and pass that in:

        spanPolicy, spanPolicyEndKey, err = d.opts.Experimental.SpanPolicyFunc(c.bounds.Start)
	if err != nil {
		return runner.Finish().WithError(err)
	}
	if len(spanPolicyEndKey) == 0 || d.cmp(c.bounds.End.Key, spanPolicyEndKey) < 0 {
		compactionValueStoragePolicy = spanPolicy.ValueStoragePolicy
	}

This allows us to rewrite blob files if any of the inputs were encoded using a different policy.

(still parsing the comments here)

Then I noticed preserveBlobReferences doesn't do anything in SetNextOutputConfig, so the preserveBlobReferences.MinimumSize return value is quite arbitrary, which we then fill into the sstable properties tw.SetValueSeparationProps.

The existing logic looks right to me. We'll only preserve blob references if all the compaction inputs used the same minimum size for the purpose of separating values. So this propagated minimum size holds for the outputted sstables as well.

However by making this additional value size configurable, the policy can no longer be expressed by just 1 minimum size. We now also have to capture the minimum size of MVCC garbage at the time it was written.

Copy link
Collaborator

@jbowens jbowens left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 3 of 26 files reviewed, 3 unresolved discussions (waiting on @annrpom, @jbowens, @RaduBerinde, and @sumeerbhola)


value_separation.go line 45 at r3 (raw file):

Previously, jbowens (Jackson Owens) wrote…

Then I noticed preserveBlobReferences doesn't do anything in SetNextOutputConfig, so the preserveBlobReferences.MinimumSize return value is quite arbitrary, which we then fill into the sstable properties tw.SetValueSeparationProps.

The existing logic looks right to me. We'll only preserve blob references if all the compaction inputs used the same minimum size for the purpose of separating values. So this propagated minimum size holds for the outputted sstables as well.

However by making this additional value size configurable, the policy can no longer be expressed by just 1 minimum size. We now also have to capture the minimum size of MVCC garbage at the time it was written.

One thing to keep in mind, I don't think we should go too hard on configuration clean up in this PR, because our intent is to backport the knob to 25.4 (bc our motivation for adding it in the first place was to be able temper the behavior if we see issues in the value separation rollout).

Copy link
Collaborator

@sumeerbhola sumeerbhola left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 3 of 26 files reviewed, 3 unresolved discussions (waiting on @annrpom, @jbowens, and @RaduBerinde)


options.go line 822 at r1 (raw file):

Previously, jbowens (Jackson Owens) wrote…

At the SpanPolicy level we currently have DisableValueSeparationBySuffix:

	// DisableValueSeparationBySuffix disables discriminating KVs depending on
	// suffix.
	//
	// Among a set of keys with the same prefix, Pebble's default heuristics
	// optimize access to the KV with the smallest suffix. This is useful for MVCC
	// keys (where the smallest suffix is the latest version), but should be
	// disabled for keys where the suffix does not correspond to a version.
	DisableValueSeparationBySuffix bool

Are you suggesting moving it into the ValueStoragePolicy? I think it's currently at the higher level because it also configures whether to separate values into value blocks within the same sstable. I don't really have an opinion where it lives.

Good point. I don't have an opinion either. Just wanted it captured somehow, and we already do.
Do we use it in blob value separation, or are we relying on NoValueSeparation?


value_separation.go line 45 at r3 (raw file):

I didn't follow this; I didn't see any changes from enum to struct.

ValueStoragePolicy used to be an enum and now it is a struct. I may be missing something.

We split sstables at new policies during Open, so we expect the vast majority of compactions to be within the same span policy

Yep, I understand that part (mainly because I had to heavily modify this code for the tiered storage prototype).

The existing logic looks right to me. We'll only preserve blob references if all the compaction inputs used the same minimum size for the purpose of separating values.

This seems risky since if we happen to have files of 2 policies in a compaction, we will unnecessarily rewrite the values. That is if [a,c) has min-size 100 and [c,e) has min-size 500, and we have a compaction over the interval [b,d) we would rewrite the blob references. Why don't we check whether the minimum size is correct by iterating over the SpanPolicys in the compaction and coordinate that iteration with that of the input files?

I suspect what you are saying is that because every level splits at these policy boundaries, we should never be expanding a compaction to cover multiple policies (unless the policy changed)? Am I understanding correctly? But we also add manual compactions, which aren't necessarily working with such constrained spans. I think if we are relying on some assumptions it would be nice to spell them out.

Also maybe worth taking a look at the value_separation.go, SpanPolicy, SpanPolicyFunc and DB.compactAndWrite -- there is iteration over the policies that happens for other reasons. It may help refining this in a way that will not need to be rethought when we add the TieringPolicy.

One thing to keep in mind, I don't think we should go too hard on configuration clean up in this PR, because our intent is to backport the knob to 25.4

Oh, I wasn't aware of that. In that case I'd prefer for @annrpom to do a small PR that you review that then gets backported. We can come back to this larger one after that.

@xinhaoz
Copy link
Member

xinhaoz commented Oct 29, 2025

value_separation.go line 45 at r3 (raw file):

Previously, sumeerbhola wrote…

I didn't follow this; I didn't see any changes from enum to struct.

ValueStoragePolicy used to be an enum and now it is a struct. I may be missing something.

We split sstables at new policies during Open, so we expect the vast majority of compactions to be within the same span policy

Yep, I understand that part (mainly because I had to heavily modify this code for the tiered storage prototype).

The existing logic looks right to me. We'll only preserve blob references if all the compaction inputs used the same minimum size for the purpose of separating values.

This seems risky since if we happen to have files of 2 policies in a compaction, we will unnecessarily rewrite the values. That is if [a,c) has min-size 100 and [c,e) has min-size 500, and we have a compaction over the interval [b,d) we would rewrite the blob references. Why don't we check whether the minimum size is correct by iterating over the SpanPolicys in the compaction and coordinate that iteration with that of the input files?

I suspect what you are saying is that because every level splits at these policy boundaries, we should never be expanding a compaction to cover multiple policies (unless the policy changed)? Am I understanding correctly? But we also add manual compactions, which aren't necessarily working with such constrained spans. I think if we are relying on some assumptions it would be nice to spell them out.

Also maybe worth taking a look at the value_separation.go, SpanPolicy, SpanPolicyFunc and DB.compactAndWrite -- there is iteration over the policies that happens for other reasons. It may help refining this in a way that will not need to be rethought when we add the TieringPolicy.

One thing to keep in mind, I don't think we should go too hard on configuration clean up in this PR, because our intent is to backport the knob to 25.4

Oh, I wasn't aware of that. In that case I'd prefer for @annrpom to do a small PR that you review that then gets backported. We can come back to this larger one after that.

Synced with Annie after I happened to see the discussion here, and I applied some of the suggestions in #5518 since I just merged some of the changes in question last week.
We can move review/discussion of the preserving logic there!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

db: make minimum bytes eligible for likely mvcc garbage separation configurable

6 participants