sstable,db: introduce a sstable-internal ObsoleteBit in the key kind#2559
sstable,db: introduce a sstable-internal ObsoleteBit in the key kind#2559sumeerbhola merged 1 commit intocockroachdb:masterfrom
Conversation
jbowens
left a comment
There was a problem hiding this comment.
flushing some comments
Reviewed 2 of 37 files at r1.
Reviewable status: 2 of 37 files reviewed, 6 unresolved discussions (waiting on @itsbilal and @sumeerbhola)
-- commits line 22 at r1:
Are there any potential new uses of MERGE keys in Cockroach that we might be sad about closing the door on?
I think if we wanted to support them, we could use add a PREMERGE key kind that holds the pre-merged operands and write both the PREMERGE (live) key and the individual MERGE (obsolete bit set) operands (including for the most recent operand). A sstable iterator would be required to return either only the contained PREMERGEs (if the iterator's seqnum is higher than the sstable's LargestSeqNum) or only the contained MERGEs (if not). PREMERGE would otherwise behave just like MERGE, (or the sstable iterator could transparently edit the kind to MERGE before surfacing).
I'm wondering if implementing PREMERGE and leaving the door open on additional MERGE usage might be preferable to implementing the 'strict' mode distinction.
-- commits line 97 at r1:
nice
sstable/block.go line 893 at r1 (raw file):
} else { // k >= search key, so prune everything after index (since index // satisfies the property we are looking for).
thanks for adding these comments
sstable/block.go line 1078 at r1 (raw file):
trailer := binary.LittleEndian.Uint64(i.key[n:]) hiddenPoint = i.hideObsoletePoints && (trailer&trailerObsoleteBit != 0)
nit: curious about the hiddenPoint variable and its scope; could this be
if i.hideObsoletePoints && (trailer&trailerObsoleteBit != 0) {
goto start
}
sstable/block.go line 1365 at r1 (raw file):
if prefixChanged || i.cmp(i.ikey.UserKey, succKey) >= 0 { // Prefix has changed. if hiddenPoint {
is this possible? if the prefix changed, it must necessarily be the first of its user key and thus not obsolete?
sstable/block.go line 1404 at r1 (raw file):
trailer := binary.LittleEndian.Uint64(i.key[n:]) hiddenPoint = i.hideObsoletePoints && (trailer&trailerObsoleteBit != 0)
nit similar question about hiddenPoint and its scope
sumeerbhola
left a comment
There was a problem hiding this comment.
TFTR!
I'll start working on cleaning up the various TODOs
Reviewable status: 2 of 37 files reviewed, 5 unresolved discussions (waiting on @itsbilal and @jbowens)
-- commits line 22 at r1:
Good idea for PREMERGE. I'll add it to the issue, so that we don't forget.
Are there any potential new uses of MERGE keys in Cockroach that we might be sad about closing the door on?
I don't think anything new is coming up in this regard. We've rarely discussed supporting more sophisticated state machine operations (read-modify-write, like counters) and one could optimize application for some subset of those by avoiding the read altogether (e.g. counter), and using a merge. But the benefit of such read-modify-write operations is mostly at higher layers (reduced distributed contention footprint) and avoiding the read at the "local" Pebble layer does not factor into the costs.
I'm wondering if implementing PREMERGE and leaving the door open on additional MERGE usage might be preferable to implementing the 'strict' mode distinction.
I don't think we need that to be future proof. We can easily do something like the following sequence:
- Support strict and only write strict ssts for the disagg storage key space.
- New feature comes up that needs MERGE in the disagg storage key space: Add support for PREMERGE (or some other solution) in a new sstable format. New ssts for the disagg storage key space start getting written in this new format and can handle MERGE.
I'm wary of complicating with a PREMERGE when the only use case is our current timeseries.
sstable/block.go line 1078 at r1 (raw file):
Previously, jbowens (Jackson Owens) wrote…
nit: curious about the hiddenPoint variable and its scope; could this be
if i.hideObsoletePoints && (trailer&trailerObsoleteBit != 0) { goto start }
Good observation. It can be in the scope of the if-block. I played around with (a) your suggestion and (b) goto in the same place (but hiddenPoint declared in the scope of the if-block). (a) is slightly slower for BenchmarkBlockIter* -- maybe it disrupts the instruction pipeline due to misprediction since the goto path never gets taken. (b) is the same as the current code. So I've changed to (b)
sstable/block.go line 1365 at r1 (raw file):
Previously, jbowens (Jackson Owens) wrote…
is this possible? if the prefix changed, it must necessarily be the first of its user key and thus not obsolete?
It can be a hidden point due to rangedels.
sstable/block.go line 1404 at r1 (raw file):
Previously, jbowens (Jackson Owens) wrote…
nit similar question about hiddenPoint and its scope
Done
301da5b to
357cda7
Compare
|
I've marked this as ready for review since much of the cleanup is done. My next steps:
I'll ping when these steps are done. |
f632070 to
38f33c8
Compare
sumeerbhola
left a comment
There was a problem hiding this comment.
Add more unit tests.
Tests are ready.
Investigate the correctness TODO in compaction_iter.go.
Done
This is ready for a full review.
Reviewed 13 of 37 files at r1, 1 of 1 files at r2.
Reviewable status: 4 of 51 files reviewed, 5 unresolved discussions (waiting on @itsbilal and @jbowens)
jbowens
left a comment
There was a problem hiding this comment.
Reviewed 16 of 37 files at r1, 17 of 21 files at r3, 16 of 16 files at r4, 1 of 1 files at r5, all commit messages.
Reviewable status: all files reviewed, 12 unresolved discussions (waiting on @itsbilal and @sumeerbhola)
Previously, sumeerbhola wrote…
Good idea for PREMERGE. I'll add it to the issue, so that we don't forget.
Are there any potential new uses of MERGE keys in Cockroach that we might be sad about closing the door on?
I don't think anything new is coming up in this regard. We've rarely discussed supporting more sophisticated state machine operations (read-modify-write, like counters) and one could optimize application for some subset of those by avoiding the read altogether (e.g. counter), and using a merge. But the benefit of such read-modify-write operations is mostly at higher layers (reduced distributed contention footprint) and avoiding the read at the "local" Pebble layer does not factor into the costs.
I'm wondering if implementing PREMERGE and leaving the door open on additional MERGE usage might be preferable to implementing the 'strict' mode distinction.
I don't think we need that to be future proof. We can easily do something like the following sequence:
- Support strict and only write strict ssts for the disagg storage key space.
- New feature comes up that needs MERGE in the disagg storage key space: Add support for PREMERGE (or some other solution) in a new sstable format. New ssts for the disagg storage key space start getting written in this new format and can handle MERGE.
I'm wary of complicating with a PREMERGE when the only use case is our current timeseries.
Sounds good
compaction_iter.go line 223 at r5 (raw file):
// sameStripeNonSkippable occurs due to RANGEDELs that sort before // SET/MERGE/DEL with the same seqnum, so the RANGEDEL does not necessarily // delete the subsequent SET/MERGE/DEL keys.
I'm hoping to remove this wart by resurrecting this oldie: #1659. When rangedels are interleaved at the infinite sequence number, they can never appear within a stripe only before. I think that will allow us to remove sameStripeNonSkippable.
IIRC, there was some awkward bit of #1659 that I wanted to address before merging, and it died on the vein. I doubt it would be much work to resurrect through.
internal/base/internal.go line 129 at r5 (raw file):
// Assert InternalKeyKindSSTableInternalObsoleteBit > InternalKeyKindMax const _ = uint(InternalKeyKindSSTableInternalObsoleteBit - InternalKeyKindMax - 1)
nice
sstable/block.go line 1365 at r1 (raw file):
Previously, sumeerbhola wrote…
It can be a hidden point due to rangedels.
ah, right!
sstable/block.go line 94 at r5 (raw file):
n := len(w.curKey) - base.InternalTrailerLen if n < 0 { panic("corrupt key in blockWriter buffer")
nit: we could update this to use errors.AssertionFailedf
sstable/block.go line 633 at r5 (raw file):
if n := len(firstKey) - 8; n >= 0 { // NB: we do not track whether the firstKey is obsolete since the trailer // of the firstKey is not used.
should this function even bother decoding the trailer? we could update i.firstKey to be just a []byte user key?
sstable/block.go line 708 at r5 (raw file):
var k InternalKey if n := len(s) - 8; n >= 0 { k.Trailer = binary.LittleEndian.Uint64(s[n:])
this is an aside and might be an insignificant micro-optimization that's not worth the code duplication, but your optimization here has me wondering what benefit there might be to using a specialized blockIter for index blocks.
index blocks should typically be much more dense, since their values are just small block handles and the keys may be shortened by Comparer.Separator. i think this means seeking within them should be a relatively large portion of a sstable seek's cpu cost. we also know their keys' trailers never matter, so there's no need to decode them even for the returned KV. we also know their values are always inlined, so we could avoid any of the overhead from LazyValue. now that we have obsolete key kinds, we can also avoid any overhead from checking obsolete bits / masking. i think we also never call SeekLT/NextPrefix on them, so it doesn't quite double the code surface area. it's also a step towards optimizations that change the index block encoding (like omitting trailers from the encoded block).
if you think it's worth an issue, i'll file one.
sstable/block.go line 949 at r5 (raw file):
// was hidden. Our current assumption is that if there are large numbers // of hidden keys we will be able to skip whole blocks (using block // property filters) so we don't bother optimizing.
one counter consideration here is that obsolete keys' values will now be stored out-of-band, increasing the density of keys within a block, requiring more obsolete keys to fill a whole block with obsolete keys. i'm still not sure it's worthwhile to address this optimization, but thought i'd flag it.
sstable/suffix_rewriter.go line 69 at r5 (raw file):
// sstable to be rewritten. Validating the former condition is non-trivial // when rewriting blocks in parallel, so we could encode the pre-existing // condition in a NumObsoletePoints property.
nit: we already have the SnapshotPinnedKeys property which I think captures the same thing, so I think we'd be updating the external sst writer to calculate it and encode it.
sstable/suffix_rewriter.go line 455 at r5 (raw file):
// suffix. The is significantly slower than the parallelized rewriter, and does // more work to rederive filters, props, etc, however re-doing that work makes // it less restrictive -- props no longer need to ??
yeah, idk either
38f33c8 to
152d19a
Compare
sumeerbhola
left a comment
There was a problem hiding this comment.
TFTR!
Reviewable status: 48 of 51 files reviewed, 11 unresolved discussions (waiting on @itsbilal and @jbowens)
Previously, jbowens (Jackson Owens) wrote…
Sounds good
for the review record, the idea is recorded in #2465 (comment)
compaction_iter.go line 223 at r5 (raw file):
Previously, jbowens (Jackson Owens) wrote…
I'm hoping to remove this wart by resurrecting this oldie: #1659. When rangedels are interleaved at the infinite sequence number, they can never appear within a stripe only before. I think that will allow us to remove
sameStripeNonSkippable.IIRC, there was some awkward bit of #1659 that I wanted to address before merging, and it died on the vein. I doubt it would be much work to resurrect through.
Ack. I didn't realize that PR existed.
sstable/block.go line 94 at r5 (raw file):
Previously, jbowens (Jackson Owens) wrote…
nit: we could update this to use
errors.AssertionFailedf
Done
sstable/block.go line 633 at r5 (raw file):
Previously, jbowens (Jackson Owens) wrote…
should this function even bother decoding the trailer? we could update
i.firstKeyto be just a[]byteuser key?
Done
sstable/block.go line 708 at r5 (raw file):
Previously, jbowens (Jackson Owens) wrote…
this is an aside and might be an insignificant micro-optimization that's not worth the code duplication, but your optimization here has me wondering what benefit there might be to using a specialized blockIter for index blocks.
index blocks should typically be much more dense, since their values are just small block handles and the keys may be shortened by
Comparer.Separator. i think this means seeking within them should be a relatively large portion of a sstable seek's cpu cost. we also know their keys' trailers never matter, so there's no need to decode them even for the returned KV. we also know their values are always inlined, so we could avoid any of the overhead from LazyValue. now that we have obsolete key kinds, we can also avoid any overhead from checking obsolete bits / masking. i think we also never callSeekLT/NextPrefixon them, so it doesn't quite double the code surface area. it's also a step towards optimizations that change the index block encoding (like omitting trailers from the encoded block).if you think it's worth an issue, i'll file one.
Good idea. Definitely worth an issue.
sstable/block.go line 949 at r5 (raw file):
Previously, jbowens (Jackson Owens) wrote…
one counter consideration here is that obsolete keys' values will now be stored out-of-band, increasing the density of keys within a block, requiring more obsolete keys to fill a whole block with obsolete keys. i'm still not sure it's worthwhile to address this optimization, but thought i'd flag it.
I didn't quite follow this. We are not currently storing the values out-of-band in value blocks. And the caching mentioned here is the one for SeekLT which is based on restart interval.
sstable/suffix_rewriter.go line 69 at r5 (raw file):
Previously, jbowens (Jackson Owens) wrote…
nit: we already have the
SnapshotPinnedKeysproperty which I think captures the same thing, so I think we'd be updating the external sst writer to calculate it and encode it.
Good point. Updated the comment.
sstable/suffix_rewriter.go line 455 at r5 (raw file):
Previously, jbowens (Jackson Owens) wrote…
yeah, idk either
I dropped the sentence and added a todo earlier in the file
// TODO(sumeer): document limitations, if any, due to this limited
// re-computation of properties (is there any loss of fidelity?).
This bit marks keys that are obsolete because they are not the newest
seqnum for a user key (in that sstable), or they are masked by a
RANGEDEL.
Setting the obsolete bit on point keys is advanced usage, so we support 2
modes, both of which must be truthful when setting the obsolete bit, but
vary in when they don't set the obsolete bit.
- Non-strict: In this mode, the bit does not need to be set for keys that
are obsolete. Additionally, any sstable containing MERGE keys can only
use this mode. An iterator over such an sstable, when configured to
hideObsoletePoints, can expose multiple internal keys per user key, and
can expose keys that are deleted by rangedels in the same sstable. This
is the mode that non-advanced users should use. Pebble without
disaggregated storage will also use this mode and will best-effort set
the obsolete bit, to optimize iteration when snapshots have retained many
obsolete keys.
- Strict: In this mode, every obsolete key must have the obsolete bit set,
and no MERGE keys are permitted. An iterator over such an sstable, when
configured to hideObsoletePoints satisfies two properties:
- S1: will expose at most one internal key per user key, which is the
most recent one.
- S2: will never expose keys that are deleted by rangedels in the same
sstable.
This is the mode for two use cases in disaggregated storage (which will
exclude parts of the key space that has MERGEs), for levels that contain
sstables that can become foreign sstables.
- Pebble compaction output to these levels that can become foreign
sstables.
- CockroachDB ingest operations that can ingest into the levels that can
become foreign sstables. Note, these are not sstables corresponding to
copied data for CockroachDB range snapshots. This case occurs for
operations like index backfills: these trivially satisfy the strictness
criteria since they only write one key per userkey.
The strictness of the sstable is written to the Properties block.
The Writer implementation discovers keys that are obsolete because they
are the same userkey as the previous key. This can be cheaply done since
we already do user key comparisons in the Writer. For keys obsoleted by
RANGEDELs, the Writer relies on the caller.
On the read path, the obsolete bit is removed by the blockIter. Since
everything reading an sstable uses a blockIter, this prevents any leakage
of this bit. Some effort was made to reduce the regression on the
iteration path, but TableIterNext has +5.84% regression. Some of the
slowdown is clawed back by improvements to Seek (e.g. SeekGE is now faster).
old is master:
name old time/op new time/op delta
BlockIterSeekGE/restart=16-16 474ns ± 1% 450ns ± 1% -5.16% (p=0.000 n=10+10)
BlockIterSeekLT/restart=16-16 520ns ± 0% 526ns ± 0% +1.20% (p=0.000 n=10+10)
BlockIterNext/restart=16-16 19.3ns ± 1% 21.0ns ± 0% +8.76% (p=0.000 n=10+10)
BlockIterPrev/restart=16-16 38.7ns ± 1% 39.9ns ± 0% +3.20% (p=0.000 n=9+9)
TableIterSeekGE/restart=16,compression=Snappy-16 1.65µs ± 1% 1.61µs ± 3% -2.24% (p=0.000 n=9+10)
TableIterSeekGE/restart=16,compression=ZSTD-16 1.67µs ± 3% 1.58µs ± 3% -5.11% (p=0.000 n=10+10)
TableIterSeekLT/restart=16,compression=Snappy-16 1.75µs ± 3% 1.68µs ± 2% -4.14% (p=0.000 n=10+9)
TableIterSeekLT/restart=16,compression=ZSTD-16 1.74µs ± 3% 1.69µs ± 3% -2.54% (p=0.001 n=10+10)
TableIterNext/restart=16,compression=Snappy-16 23.9ns ± 1% 25.3ns ± 0% +5.84% (p=0.000 n=10+10)
TableIterNext/restart=16,compression=ZSTD-16 23.9ns ± 1% 25.3ns ± 0% +5.78% (p=0.000 n=10+10)
TableIterPrev/restart=16,compression=Snappy-16 45.2ns ± 1% 46.2ns ± 1% +2.09% (p=0.000 n=10+10)
TableIterPrev/restart=16,compression=ZSTD-16 45.3ns ± 0% 46.3ns ± 0% +2.23% (p=0.000 n=8+9)
IteratorScanManyVersions/format=(Pebble,v2)/cache-size=20_M/read-value=false-16 51.7ns ± 1% 55.2ns ± 4% +6.82% (p=0.000 n=10+10)
IteratorScanManyVersions/format=(Pebble,v2)/cache-size=20_M/read-value=true-16 54.9ns ± 1% 56.4ns ± 3% +2.73% (p=0.000 n=10+10)
IteratorScanManyVersions/format=(Pebble,v2)/cache-size=150_M/read-value=false-16 35.0ns ± 1% 34.8ns ± 1% -0.56% (p=0.037 n=10+10)
IteratorScanManyVersions/format=(Pebble,v2)/cache-size=150_M/read-value=true-16 37.8ns ± 0% 38.0ns ± 1% +0.55% (p=0.018 n=9+10)
IteratorScanManyVersions/format=(Pebble,v3)/cache-size=20_M/read-value=false-16 41.5ns ± 2% 42.4ns ± 1% +2.18% (p=0.000 n=10+10)
IteratorScanManyVersions/format=(Pebble,v3)/cache-size=20_M/read-value=true-16 94.7ns ± 4% 97.0ns ± 8% ~ (p=0.133 n=9+10)
IteratorScanManyVersions/format=(Pebble,v3)/cache-size=150_M/read-value=false-16 35.4ns ± 2% 36.5ns ± 1% +2.97% (p=0.000 n=10+8)
IteratorScanManyVersions/format=(Pebble,v3)/cache-size=150_M/read-value=true-16 60.1ns ± 1% 57.8ns ± 0% -3.84% (p=0.000 n=9+9)
IteratorScanNextPrefix/versions=1/method=seek-ge/read-value=false-16 135ns ± 1% 136ns ± 1% +0.44% (p=0.009 n=9+10)
IteratorScanNextPrefix/versions=1/method=seek-ge/read-value=true-16 139ns ± 0% 139ns ± 0% +0.48% (p=0.000 n=10+8)
IteratorScanNextPrefix/versions=1/method=next-prefix/read-value=false-16 34.8ns ± 1% 35.5ns ± 2% +2.12% (p=0.000 n=9+10)
IteratorScanNextPrefix/versions=1/method=next-prefix/read-value=true-16 37.6ns ± 0% 38.6ns ± 1% +2.53% (p=0.000 n=10+10)
IteratorScanNextPrefix/versions=2/method=seek-ge/read-value=false-16 215ns ± 1% 216ns ± 0% ~ (p=0.341 n=10+10)
IteratorScanNextPrefix/versions=2/method=seek-ge/read-value=true-16 220ns ± 1% 220ns ± 0% ~ (p=0.983 n=10+8)
IteratorScanNextPrefix/versions=2/method=next-prefix/read-value=false-16 41.6ns ± 1% 42.6ns ± 2% +2.42% (p=0.000 n=10+10)
IteratorScanNextPrefix/versions=2/method=next-prefix/read-value=true-16 44.6ns ± 1% 45.6ns ± 1% +2.28% (p=0.000 n=10+10)
IteratorScanNextPrefix/versions=10/method=seek-ge/read-value=false-16 2.16µs ± 0% 2.06µs ± 1% -4.27% (p=0.000 n=10+10)
IteratorScanNextPrefix/versions=10/method=seek-ge/read-value=true-16 2.15µs ± 1% 2.07µs ± 0% -3.71% (p=0.000 n=9+10)
IteratorScanNextPrefix/versions=10/method=next-prefix/read-value=false-16 94.1ns ± 1% 95.9ns ± 2% +1.94% (p=0.000 n=10+10)
IteratorScanNextPrefix/versions=10/method=next-prefix/read-value=true-16 97.5ns ± 1% 98.2ns ± 1% +0.69% (p=0.023 n=10+10)
IteratorScanNextPrefix/versions=100/method=seek-ge/read-value=false-16 2.81µs ± 1% 2.66µs ± 1% -5.29% (p=0.000 n=9+10)
IteratorScanNextPrefix/versions=100/method=seek-ge/read-value=true-16 2.82µs ± 1% 2.67µs ± 0% -5.47% (p=0.000 n=8+10)
IteratorScanNextPrefix/versions=100/method=next-prefix/read-value=false-16 689ns ± 4% 652ns ± 5% -5.32% (p=0.000 n=10+10)
IteratorScanNextPrefix/versions=100/method=next-prefix/read-value=true-16 694ns ± 2% 657ns ± 1% -5.28% (p=0.000 n=10+8)
Looking at mergingIter, the Next regression seems tolerable, and SeekGE
is better.
name old time/op new time/op delta
MergingIterSeekGE/restart=16/count=1-16 1.25µs ± 3% 1.15µs ± 1% -8.51% (p=0.000 n=10+10)
MergingIterSeekGE/restart=16/count=2-16 2.49µs ± 2% 2.28µs ± 2% -8.39% (p=0.000 n=10+10)
MergingIterSeekGE/restart=16/count=3-16 3.82µs ± 3% 3.57µs ± 1% -6.54% (p=0.000 n=10+10)
MergingIterSeekGE/restart=16/count=4-16 5.31µs ± 2% 4.86µs ± 2% -8.39% (p=0.000 n=10+10)
MergingIterSeekGE/restart=16/count=5-16 6.88µs ± 1% 6.36µs ± 2% -7.49% (p=0.000 n=10+10)
MergingIterNext/restart=16/count=1-16 46.0ns ± 1% 46.6ns ± 1% +1.13% (p=0.000 n=10+10)
MergingIterNext/restart=16/count=2-16 72.8ns ± 1% 73.0ns ± 0% ~ (p=0.363 n=10+10)
MergingIterNext/restart=16/count=3-16 93.5ns ± 0% 93.1ns ± 1% ~ (p=0.507 n=10+9)
MergingIterNext/restart=16/count=4-16 104ns ± 0% 104ns ± 1% ~ (p=0.078 n=8+10)
MergingIterNext/restart=16/count=5-16 121ns ± 1% 121ns ± 1% -0.52% (p=0.008 n=10+10)
MergingIterPrev/restart=16/count=1-16 66.6ns ± 1% 67.8ns ± 1% +1.81% (p=0.000 n=10+10)
MergingIterPrev/restart=16/count=2-16 93.2ns ± 0% 94.4ns ± 1% +1.24% (p=0.000 n=10+10)
MergingIterPrev/restart=16/count=3-16 114ns ± 0% 114ns ± 1% +0.36% (p=0.032 n=9+10)
MergingIterPrev/restart=16/count=4-16 122ns ± 1% 123ns ± 0% +0.41% (p=0.014 n=10+9)
MergingIterPrev/restart=16/count=5-16 138ns ± 1% 138ns ± 0% +0.52% (p=0.012 n=10+10)
MergingIterSeqSeekGEWithBounds/levelCount=5-16 572ns ± 1% 572ns ± 0% ~ (p=0.842 n=10+9)
MergingIterSeqSeekPrefixGE/skip=1/use-next=false-16 1.85µs ± 1% 1.76µs ± 1% -4.85% (p=0.000 n=10+9)
MergingIterSeqSeekPrefixGE/skip=1/use-next=true-16 443ns ± 0% 444ns ± 1% ~ (p=0.255 n=10+10)
MergingIterSeqSeekPrefixGE/skip=2/use-next=false-16 1.86µs ± 1% 1.77µs ± 1% -4.63% (p=0.000 n=10+10)
MergingIterSeqSeekPrefixGE/skip=2/use-next=true-16 486ns ± 1% 482ns ± 1% -0.80% (p=0.000 n=10+10)
MergingIterSeqSeekPrefixGE/skip=4/use-next=false-16 1.93µs ± 1% 1.83µs ± 1% -4.95% (p=0.000 n=10+10)
MergingIterSeqSeekPrefixGE/skip=4/use-next=true-16 570ns ± 0% 567ns ± 2% -0.47% (p=0.020 n=10+10)
MergingIterSeqSeekPrefixGE/skip=8/use-next=false-16 2.12µs ± 0% 2.03µs ± 1% -4.38% (p=0.000 n=10+10)
MergingIterSeqSeekPrefixGE/skip=8/use-next=true-16 1.43µs ± 1% 1.39µs ± 1% -2.57% (p=0.000 n=10+10)
MergingIterSeqSeekPrefixGE/skip=16/use-next=false-16 2.28µs ± 1% 2.18µs ± 0% -4.54% (p=0.000 n=10+10)
MergingIterSeqSeekPrefixGE/skip=16/use-next=true-16 1.59µs ± 1% 1.53µs ± 1% -3.71% (p=0.000 n=10+9)
Finally, a read benchmark where all except the first key is obsolete
shows improvement.
BenchmarkIteratorScanObsolete/format=(Pebble,v3)/cache-size=1_B/hide-obsolete=false-10 36 32300029 ns/op 2 B/op 0 allocs/op
BenchmarkIteratorScanObsolete/format=(Pebble,v3)/cache-size=1_B/hide-obsolete=true-10 36 32418979 ns/op 3 B/op 0 allocs/op
BenchmarkIteratorScanObsolete/format=(Pebble,v3)/cache-size=150_M/hide-obsolete=false-10 82 13357163 ns/op 1 B/op 0 allocs/op
BenchmarkIteratorScanObsolete/format=(Pebble,v3)/cache-size=150_M/hide-obsolete=true-10 90 13256770 ns/op 1 B/op 0 allocs/op
BenchmarkIteratorScanObsolete/format=(Pebble,v4)/cache-size=1_B/hide-obsolete=false-10 36 32396367 ns/op 2 B/op 0 allocs/op
BenchmarkIteratorScanObsolete/format=(Pebble,v4)/cache-size=1_B/hide-obsolete=true-10 26086 46095 ns/op 0 B/op 0 allocs/op
BenchmarkIteratorScanObsolete/format=(Pebble,v4)/cache-size=150_M/hide-obsolete=false-10 88 13226711 ns/op 1 B/op 0 allocs/op
BenchmarkIteratorScanObsolete/format=(Pebble,v4)/cache-size=150_M/hide-obsolete=true-10 39171 30618 ns/op 0 B/op 0 allocs/op
Informs cockroachdb#2465
152d19a to
a4cfbb1
Compare
|
flake is #2355 |
jbowens
left a comment
There was a problem hiding this comment.
Reviewable status: 48 of 51 files reviewed, 11 unresolved discussions
sstable/block.go line 708 at r5 (raw file):
Previously, sumeerbhola wrote…
Good idea. Definitely worth an issue.
for the record, filed #2592
sstable/block.go line 949 at r5 (raw file):
Previously, sumeerbhola wrote…
I didn't quite follow this. We are not currently storing the values out-of-band in value blocks. And the caching mentioned here is the one for SeekLT which is based on restart interval.
i meant that if/when we do store these values out-of-band, the likelihood of being able to skip whole blocks is reduced even whenthere are many hidden keys.
This bit marks keys that are obsolete because they are not the newest seqnum for a user key (in that sstable), or they are masked by a RANGEDEL.
Setting the obsolete bit on point keys is advanced usage, so we support 2 modes, both of which must be truthful when setting the obsolete bit, but vary in when they don't set the obsolete bit.
Non-strict: In this mode, the bit does not need to be set for keys that are obsolete. Additionally, any sstable containing MERGE keys can only use this mode. An iterator over such an sstable, when configured to hideObsoletePoints, can expose multiple internal keys per user key, and can expose keys that are deleted by rangedels in the same sstable. This is the mode that non-advanced users should use. Pebble without disaggregated storage will also use this mode and will best-effort set the obsolete bit, to optimize iteration when snapshots have retained many obsolete keys.
Strict: In this mode, every obsolete key must have the obsolete bit set, and no MERGE keys are permitted. An iterator over such an sstable, when configured to hideObsoletePoints satisfies two properties:
The strictness of the sstable is written to the Properties block.
The Writer implementation discovers keys that are obsolete because they are the same userkey as the previous key. This can be cheaply done since we already do user key comparisons in the Writer. For keys obsoleted by RANGEDELs, the Writer relies on the caller.
On the read path, the obsolete bit is removed by the blockIter. Since everything reading an sstable uses a blockIter, this prevents any leakage of this bit. Some effort was made to reduce the regression on the iteration path, but TableIterNext has +5.84% regression. Some of the slowdown is clawed back by improvements to Seek (e.g. SeekGE is now faster).
old is master:
Looking at mergingIter, the Next regression seems tolerable, and SeekGE is better.
Finally, a read benchmark where all except the first key is obsolete shows improvement.
Informs #2465