-
Notifications
You must be signed in to change notification settings - Fork 514
db: min size for mvcc separation should be configurable #5410
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
c0ab56d to
3e4eb3e
Compare
3e4eb3e to
1c0a72c
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: 0 of 17 files reviewed, 1 unresolved discussion
options.go line 822 at r1 (raw file):
// be separated into a blob file when the value is likely MVCC garbage // or a part of the latency tolerant span key range. LatencyTolerantSpanPolicy func() compact.ValueSeparationOutputConfig
I am always a bit confused about our configuration approach to value separation.
- We have a globally applicable ValueSeparationPolicy with many parameters. This is very reasonable.
- We tweak it in two ways (a) no value separation, (b) the minimum size, based on the ValueStoragePolicy enum in the SpanPolicy. This tweaking happens in code, instead of the SpanPolicy explicitly specifying the pair (doValueSeparation, minimumSize), where both of the elements in that tuple would allow a special value that says "use global setting". In general, mapping things from enum into configs is probably best left to the highest layer (like CockroachDB).
Now we are adding another option that makes that minimum size configurable and it uses a func, when we already have a dynamic ValueSeparationPolicy via Options.Experimental.ValueSeparationPolicy func and we have a dynamic SpanPolicy to override the global behavior via SpanPolicyFunc. I think this is making the configuration more confusing. And we seem to have decided that the minimum size should be the same for latency tolerant and MVCC garbage -- that decision is better left to CockroachDB.
How about something like:
type ValueSeparationPolicy struct {
...
MinimumSize int
// Applies only to SpanPolicys that permit separation of MVCC garbage, which is also the default.
MinimumMVCCGarbageSize int
...
}
type SpanPolicy struct {
...
ValueStoragePolicy ValueStoragePolicy
}
type ValueStoragePolicy {
PolicyAdjustment ValueStoragePolicyAdjustment
// Remaining fields are ignored, unless the adjustment is Override.
HasMVCCGarbage bool
MinimumSize int
MinimumMVCCGarbageSize int.
}
// Possibe values are UseDefault, NoValueSeparation, Override.
type ValueStoragePolicyAdjustment uint8
|
Previously, sumeerbhola wrote…
I agree - Pebble can define a "recommended" low latency value for |
1c0a72c to
dd65df3
Compare
dd65df3 to
653b029
Compare
We will add a field, `MinimumMVCCGarbageSize`, that specifies the minimum value size required for likely MVCC garbage to be eligible for separation. Fixes: cockroachdb#5377
653b029 to
7e24055
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: 0 of 26 files reviewed, 1 unresolved discussion (waiting on @jbowens, @RaduBerinde, and @sumeerbhola)
options.go line 822 at r1 (raw file):
Previously, RaduBerinde wrote…
I agree - Pebble can define a "recommended" low latency value for
ValueStoragePolicythat CRDB can use to populate the field - but it should all be returned through the span policy.
@sumeerbhola done for the most part: what were you thinking for the HasMVCCGarbage field in the ValueStoragePolicy struct? i'm a bit confused by its purpose
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@sumeerbhola reviewed 1 of 26 files at r2, 2 of 5 files at r3.
Reviewable status: 3 of 26 files reviewed, 3 unresolved discussions (waiting on @jbowens and @RaduBerinde)
options.go line 822 at r1 (raw file):
Previously, annrpom (annie pompa) wrote…
@sumeerbhola done for the most part: what were you thinking for the
HasMVCCGarbagefield in theValueStoragePolicystruct? i'm a bit confused by its purpose
Perhaps I should have called IsMVCCKeys. It would be used to decided whether we consider multiple versioned keys to be MVCC and so can separate the garbage. I suspect we haven't needed it for the lock table keys since we say "NoValueSeparation" for them.
value_separation.go line 25 at r3 (raw file):
// will be separated into a blob file when the value storage policy is // Override. const latencyTolerantMinimumSize = 10
where is this used, outside of testing?
We should consider making this an exported const with comments that this is what we recommend users should use when constructing the ValueSeparationPolicy or ValueStoragePolicy structs.
value_separation.go line 45 at r3 (raw file):
policy := d.opts.Experimental.ValueSeparationPolicy() if !policy.Enabled || valueStorage.PolicyAdjustment == NoValueSeparation { return compact.NeverSeparateValues{}
I am getting quite confused with determineCompactionValueSeparation. If there is a single policy for the whole compaction we will call it with that ValueStoragePolicy. In that sense deciding early on compact.NeverSeparateValues{} is probably worthwhile, though seems conceptually an optimization, since compaction.go also does the following:
case ValueStorageLowReadLatency:
vSep = compact.NeverSeparateValues{}
When there isn't a single policy, we used to pass ValueStorageDefault. Now we have a struct. The calling code of determine... seems unchanged
Lines 3448 to 3456 in df921f1
| spanPolicy, spanPolicyEndKey, err = d.opts.Experimental.SpanPolicyFunc(c.bounds.Start) | |
| if err != nil { | |
| return runner.Finish().WithError(err) | |
| } | |
| if len(spanPolicyEndKey) == 0 || d.cmp(c.bounds.End.Key, spanPolicyEndKey) < 0 { | |
| compactionValueStoragePolicy = spanPolicy.ValueStoragePolicy | |
| } | |
| valueSeparation := c.getValueSeparation(jobID, c, compactionValueStoragePolicy) |
So we would pass a struct that says PolicyAdjustment=UseDefault but the other fields are 0? btw, we could use some unit testing of this many policies in the compaction case to ensure it is working as intended.
The other fields ought to be the default fields.
This also made me wonder why is the MinimumSize interesting enough to initialize now. We could remember the global minimum sizes and decide what to set to when SetNextOutputConfig is called. Which should be always called in the Override case (and the code seems to do that now).
Then I noticed preserveBlobReferences doesn't do anything in SetNextOutputConfig, so the preserveBlobReferences.MinimumSize return value is quite arbitrary, which we then fill into the sstable properties tw.SetValueSeparationProps. Of course, if we did do the override using SetNextOutputConfig it would still reflect the policy at the time of compaction and not what was in effect when the blob file was originally written. This seems incorrect.
And then we have the logic in shouldWriteBlobFiles which actually compares these using
if backingProps.ValueSeparationMinSize != minimumValueSizeForCompaction {
return true, 0
}
This seems dubious since there could be many policies spanning the compaction interval and we are looking at one policy (perhaps the default) and it not everything matches we would rewrite the references?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: 3 of 26 files reviewed, 3 unresolved discussions (waiting on @annrpom, @RaduBerinde, and @sumeerbhola)
options.go line 822 at r1 (raw file):
Previously, sumeerbhola wrote…
Perhaps I should have called
IsMVCCKeys. It would be used to decided whether we consider multiple versioned keys to be MVCC and so can separate the garbage. I suspect we haven't needed it for the lock table keys since we say "NoValueSeparation" for them.
At the SpanPolicy level we currently have DisableValueSeparationBySuffix:
// DisableValueSeparationBySuffix disables discriminating KVs depending on
// suffix.
//
// Among a set of keys with the same prefix, Pebble's default heuristics
// optimize access to the KV with the smallest suffix. This is useful for MVCC
// keys (where the smallest suffix is the latest version), but should be
// disabled for keys where the suffix does not correspond to a version.
DisableValueSeparationBySuffix bool
Are you suggesting moving it into the ValueStoragePolicy? I think it's currently at the higher level because it also configures whether to separate values into value blocks within the same sstable. I don't really have an opinion where it lives.
value_separation.go line 45 at r3 (raw file):
When there isn't a single policy, we used to pass
ValueStorageDefault. Now we have a struct. The calling code ofdetermine...seems unchanged
I didn't follow this; I didn't see any changes from enum to struct.
This seems dubious since there could be many policies spanning the compaction interval and we are looking at one policy (perhaps the default) and it not everything matches we would rewrite the references?
We split sstables at new policies during Open, so we expect the vast majority of compactions to be within the same span policy. We establish the value separation of the entire compaction span upfront and pass that in:
spanPolicy, spanPolicyEndKey, err = d.opts.Experimental.SpanPolicyFunc(c.bounds.Start)
if err != nil {
return runner.Finish().WithError(err)
}
if len(spanPolicyEndKey) == 0 || d.cmp(c.bounds.End.Key, spanPolicyEndKey) < 0 {
compactionValueStoragePolicy = spanPolicy.ValueStoragePolicy
}
This allows us to rewrite blob files if any of the inputs were encoded using a different policy.
(still parsing the comments here)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: 3 of 26 files reviewed, 3 unresolved discussions (waiting on @annrpom, @jbowens, @RaduBerinde, and @sumeerbhola)
value_separation.go line 45 at r3 (raw file):
Previously, jbowens (Jackson Owens) wrote…
When there isn't a single policy, we used to pass
ValueStorageDefault. Now we have a struct. The calling code ofdetermine...seems unchangedI didn't follow this; I didn't see any changes from enum to struct.
This seems dubious since there could be many policies spanning the compaction interval and we are looking at one policy (perhaps the default) and it not everything matches we would rewrite the references?
We split sstables at new policies during Open, so we expect the vast majority of compactions to be within the same span policy. We establish the value separation of the entire compaction span upfront and pass that in:
spanPolicy, spanPolicyEndKey, err = d.opts.Experimental.SpanPolicyFunc(c.bounds.Start) if err != nil { return runner.Finish().WithError(err) } if len(spanPolicyEndKey) == 0 || d.cmp(c.bounds.End.Key, spanPolicyEndKey) < 0 { compactionValueStoragePolicy = spanPolicy.ValueStoragePolicy }This allows us to rewrite blob files if any of the inputs were encoded using a different policy.
(still parsing the comments here)
Then I noticed
preserveBlobReferencesdoesn't do anything inSetNextOutputConfig, so thepreserveBlobReferences.MinimumSizereturn value is quite arbitrary, which we then fill into the sstable propertiestw.SetValueSeparationProps.
The existing logic looks right to me. We'll only preserve blob references if all the compaction inputs used the same minimum size for the purpose of separating values. So this propagated minimum size holds for the outputted sstables as well.
However by making this additional value size configurable, the policy can no longer be expressed by just 1 minimum size. We now also have to capture the minimum size of MVCC garbage at the time it was written.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: 3 of 26 files reviewed, 3 unresolved discussions (waiting on @annrpom, @jbowens, @RaduBerinde, and @sumeerbhola)
value_separation.go line 45 at r3 (raw file):
Previously, jbowens (Jackson Owens) wrote…
Then I noticed
preserveBlobReferencesdoesn't do anything inSetNextOutputConfig, so thepreserveBlobReferences.MinimumSizereturn value is quite arbitrary, which we then fill into the sstable propertiestw.SetValueSeparationProps.The existing logic looks right to me. We'll only preserve blob references if all the compaction inputs used the same minimum size for the purpose of separating values. So this propagated minimum size holds for the outputted sstables as well.
However by making this additional value size configurable, the policy can no longer be expressed by just 1 minimum size. We now also have to capture the minimum size of MVCC garbage at the time it was written.
One thing to keep in mind, I don't think we should go too hard on configuration clean up in this PR, because our intent is to backport the knob to 25.4 (bc our motivation for adding it in the first place was to be able temper the behavior if we see issues in the value separation rollout).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: 3 of 26 files reviewed, 3 unresolved discussions (waiting on @annrpom, @jbowens, and @RaduBerinde)
options.go line 822 at r1 (raw file):
Previously, jbowens (Jackson Owens) wrote…
At the SpanPolicy level we currently have
DisableValueSeparationBySuffix:// DisableValueSeparationBySuffix disables discriminating KVs depending on // suffix. // // Among a set of keys with the same prefix, Pebble's default heuristics // optimize access to the KV with the smallest suffix. This is useful for MVCC // keys (where the smallest suffix is the latest version), but should be // disabled for keys where the suffix does not correspond to a version. DisableValueSeparationBySuffix boolAre you suggesting moving it into the
ValueStoragePolicy? I think it's currently at the higher level because it also configures whether to separate values into value blocks within the same sstable. I don't really have an opinion where it lives.
Good point. I don't have an opinion either. Just wanted it captured somehow, and we already do.
Do we use it in blob value separation, or are we relying on NoValueSeparation?
value_separation.go line 45 at r3 (raw file):
I didn't follow this; I didn't see any changes from enum to struct.
ValueStoragePolicy used to be an enum and now it is a struct. I may be missing something.
We split sstables at new policies during Open, so we expect the vast majority of compactions to be within the same span policy
Yep, I understand that part (mainly because I had to heavily modify this code for the tiered storage prototype).
The existing logic looks right to me. We'll only preserve blob references if all the compaction inputs used the same minimum size for the purpose of separating values.
This seems risky since if we happen to have files of 2 policies in a compaction, we will unnecessarily rewrite the values. That is if [a,c) has min-size 100 and [c,e) has min-size 500, and we have a compaction over the interval [b,d) we would rewrite the blob references. Why don't we check whether the minimum size is correct by iterating over the SpanPolicys in the compaction and coordinate that iteration with that of the input files?
I suspect what you are saying is that because every level splits at these policy boundaries, we should never be expanding a compaction to cover multiple policies (unless the policy changed)? Am I understanding correctly? But we also add manual compactions, which aren't necessarily working with such constrained spans. I think if we are relying on some assumptions it would be nice to spell them out.
Also maybe worth taking a look at the value_separation.go, SpanPolicy, SpanPolicyFunc and DB.compactAndWrite -- there is iteration over the policies that happens for other reasons. It may help refining this in a way that will not need to be rethought when we add the TieringPolicy.
One thing to keep in mind, I don't think we should go too hard on configuration clean up in this PR, because our intent is to backport the knob to 25.4
Oh, I wasn't aware of that. In that case I'd prefer for @annrpom to do a small PR that you review that then gets backported. We can come back to this larger one after that.
|
Previously, sumeerbhola wrote…
Synced with Annie after I happened to see the discussion here, and I applied some of the suggestions in #5518 since I just merged some of the changes in question last week. |
We will add a configurable option,
LatencyTolerantSpanPolicy, thatspecifies the minimum value size required for latency tolerant spans
and likely MVCC garbage to be separated into a blob file.
Fixes: #5377