sstable,db: introduce a sstable-internal ObsoleteBit in the key kind by sumeerbhola · Pull Request #2559 · cockroachdb/pebble

sumeerbhola · 2023-05-23T23:06:48Z

This bit marks keys that are obsolete because they are not the newest seqnum for a user key (in that sstable), or they are masked by a RANGEDEL.

Setting the obsolete bit on point keys is advanced usage, so we support 2 modes, both of which must be truthful when setting the obsolete bit, but vary in when they don't set the obsolete bit.

Non-strict: In this mode, the bit does not need to be set for keys that are obsolete. Additionally, any sstable containing MERGE keys can only use this mode. An iterator over such an sstable, when configured to hideObsoletePoints, can expose multiple internal keys per user key, and can expose keys that are deleted by rangedels in the same sstable. This is the mode that non-advanced users should use. Pebble without disaggregated storage will also use this mode and will best-effort set the obsolete bit, to optimize iteration when snapshots have retained many obsolete keys.
Strict: In this mode, every obsolete key must have the obsolete bit set, and no MERGE keys are permitted. An iterator over such an sstable, when configured to hideObsoletePoints satisfies two properties:
- S1: will expose at most one internal key per user key, which is the most recent one.
- S2: will never expose keys that are deleted by rangedels in the same sstable. This is the mode for two use cases in disaggregated storage (which will exclude parts of the key space that has MERGEs), for levels that contain sstables that can become foreign sstables.
- Pebble compaction output to these levels that can become foreign sstables.
- CockroachDB ingest operations that can ingest into the levels that can become foreign sstables. Note, these are not sstables corresponding to copied data for CockroachDB range snapshots. This case occurs for operations like index backfills: these trivially satisfy the strictness criteria since they only write one key per userkey.

The strictness of the sstable is written to the Properties block.

The Writer implementation discovers keys that are obsolete because they are the same userkey as the previous key. This can be cheaply done since we already do user key comparisons in the Writer. For keys obsoleted by RANGEDELs, the Writer relies on the caller.

On the read path, the obsolete bit is removed by the blockIter. Since everything reading an sstable uses a blockIter, this prevents any leakage of this bit. Some effort was made to reduce the regression on the iteration path, but TableIterNext has +5.84% regression. Some of the slowdown is clawed back by improvements to Seek (e.g. SeekGE is now faster).

old is master:

name                                                                              old time/op    new time/op    delta
BlockIterSeekGE/restart=16-16                                                        474ns ± 1%     450ns ± 1%  -5.16%  (p=0.000 n=10+10)
BlockIterSeekLT/restart=16-16                                                        520ns ± 0%     526ns ± 0%  +1.20%  (p=0.000 n=10+10)
BlockIterNext/restart=16-16                                                         19.3ns ± 1%    21.0ns ± 0%  +8.76%  (p=0.000 n=10+10)
BlockIterPrev/restart=16-16                                                         38.7ns ± 1%    39.9ns ± 0%  +3.20%  (p=0.000 n=9+9)
TableIterSeekGE/restart=16,compression=Snappy-16                                    1.65µs ± 1%    1.61µs ± 3%  -2.24%  (p=0.000 n=9+10)
TableIterSeekGE/restart=16,compression=ZSTD-16                                      1.67µs ± 3%    1.58µs ± 3%  -5.11%  (p=0.000 n=10+10)
TableIterSeekLT/restart=16,compression=Snappy-16                                    1.75µs ± 3%    1.68µs ± 2%  -4.14%  (p=0.000 n=10+9)
TableIterSeekLT/restart=16,compression=ZSTD-16                                      1.74µs ± 3%    1.69µs ± 3%  -2.54%  (p=0.001 n=10+10)
TableIterNext/restart=16,compression=Snappy-16                                      23.9ns ± 1%    25.3ns ± 0%  +5.84%  (p=0.000 n=10+10)
TableIterNext/restart=16,compression=ZSTD-16                                        23.9ns ± 1%    25.3ns ± 0%  +5.78%  (p=0.000 n=10+10)
TableIterPrev/restart=16,compression=Snappy-16                                      45.2ns ± 1%    46.2ns ± 1%  +2.09%  (p=0.000 n=10+10)
TableIterPrev/restart=16,compression=ZSTD-16                                        45.3ns ± 0%    46.3ns ± 0%  +2.23%  (p=0.000 n=8+9)
IteratorScanManyVersions/format=(Pebble,v2)/cache-size=20_M/read-value=false-16     51.7ns ± 1%    55.2ns ± 4%  +6.82%  (p=0.000 n=10+10)
IteratorScanManyVersions/format=(Pebble,v2)/cache-size=20_M/read-value=true-16      54.9ns ± 1%    56.4ns ± 3%  +2.73%  (p=0.000 n=10+10)
IteratorScanManyVersions/format=(Pebble,v2)/cache-size=150_M/read-value=false-16    35.0ns ± 1%    34.8ns ± 1%  -0.56%  (p=0.037 n=10+10)
IteratorScanManyVersions/format=(Pebble,v2)/cache-size=150_M/read-value=true-16     37.8ns ± 0%    38.0ns ± 1%  +0.55%  (p=0.018 n=9+10)
IteratorScanManyVersions/format=(Pebble,v3)/cache-size=20_M/read-value=false-16     41.5ns ± 2%    42.4ns ± 1%  +2.18%  (p=0.000 n=10+10)
IteratorScanManyVersions/format=(Pebble,v3)/cache-size=20_M/read-value=true-16      94.7ns ± 4%    97.0ns ± 8%    ~     (p=0.133 n=9+10)
IteratorScanManyVersions/format=(Pebble,v3)/cache-size=150_M/read-value=false-16    35.4ns ± 2%    36.5ns ± 1%  +2.97%  (p=0.000 n=10+8)
IteratorScanManyVersions/format=(Pebble,v3)/cache-size=150_M/read-value=true-16     60.1ns ± 1%    57.8ns ± 0%  -3.84%  (p=0.000 n=9+9)
IteratorScanNextPrefix/versions=1/method=seek-ge/read-value=false-16                 135ns ± 1%     136ns ± 1%  +0.44%  (p=0.009 n=9+10)
IteratorScanNextPrefix/versions=1/method=seek-ge/read-value=true-16                  139ns ± 0%     139ns ± 0%  +0.48%  (p=0.000 n=10+8)
IteratorScanNextPrefix/versions=1/method=next-prefix/read-value=false-16            34.8ns ± 1%    35.5ns ± 2%  +2.12%  (p=0.000 n=9+10)
IteratorScanNextPrefix/versions=1/method=next-prefix/read-value=true-16             37.6ns ± 0%    38.6ns ± 1%  +2.53%  (p=0.000 n=10+10)
IteratorScanNextPrefix/versions=2/method=seek-ge/read-value=false-16                 215ns ± 1%     216ns ± 0%    ~     (p=0.341 n=10+10)
IteratorScanNextPrefix/versions=2/method=seek-ge/read-value=true-16                  220ns ± 1%     220ns ± 0%    ~     (p=0.983 n=10+8)
IteratorScanNextPrefix/versions=2/method=next-prefix/read-value=false-16            41.6ns ± 1%    42.6ns ± 2%  +2.42%  (p=0.000 n=10+10)
IteratorScanNextPrefix/versions=2/method=next-prefix/read-value=true-16             44.6ns ± 1%    45.6ns ± 1%  +2.28%  (p=0.000 n=10+10)
IteratorScanNextPrefix/versions=10/method=seek-ge/read-value=false-16               2.16µs ± 0%    2.06µs ± 1%  -4.27%  (p=0.000 n=10+10)
IteratorScanNextPrefix/versions=10/method=seek-ge/read-value=true-16                2.15µs ± 1%    2.07µs ± 0%  -3.71%  (p=0.000 n=9+10)
IteratorScanNextPrefix/versions=10/method=next-prefix/read-value=false-16           94.1ns ± 1%    95.9ns ± 2%  +1.94%  (p=0.000 n=10+10)
IteratorScanNextPrefix/versions=10/method=next-prefix/read-value=true-16            97.5ns ± 1%    98.2ns ± 1%  +0.69%  (p=0.023 n=10+10)
IteratorScanNextPrefix/versions=100/method=seek-ge/read-value=false-16              2.81µs ± 1%    2.66µs ± 1%  -5.29%  (p=0.000 n=9+10)
IteratorScanNextPrefix/versions=100/method=seek-ge/read-value=true-16               2.82µs ± 1%    2.67µs ± 0%  -5.47%  (p=0.000 n=8+10)
IteratorScanNextPrefix/versions=100/method=next-prefix/read-value=false-16           689ns ± 4%     652ns ± 5%  -5.32%  (p=0.000 n=10+10)
IteratorScanNextPrefix/versions=100/method=next-prefix/read-value=true-16            694ns ± 2%     657ns ± 1%  -5.28%  (p=0.000 n=10+8)

Looking at mergingIter, the Next regression seems tolerable, and SeekGE is better.

name                                                  old time/op    new time/op    delta
MergingIterSeekGE/restart=16/count=1-16                 1.25µs ± 3%    1.15µs ± 1%  -8.51%  (p=0.000 n=10+10)
MergingIterSeekGE/restart=16/count=2-16                 2.49µs ± 2%    2.28µs ± 2%  -8.39%  (p=0.000 n=10+10)
MergingIterSeekGE/restart=16/count=3-16                 3.82µs ± 3%    3.57µs ± 1%  -6.54%  (p=0.000 n=10+10)
MergingIterSeekGE/restart=16/count=4-16                 5.31µs ± 2%    4.86µs ± 2%  -8.39%  (p=0.000 n=10+10)
MergingIterSeekGE/restart=16/count=5-16                 6.88µs ± 1%    6.36µs ± 2%  -7.49%  (p=0.000 n=10+10)
MergingIterNext/restart=16/count=1-16                   46.0ns ± 1%    46.6ns ± 1%  +1.13%  (p=0.000 n=10+10)
MergingIterNext/restart=16/count=2-16                   72.8ns ± 1%    73.0ns ± 0%    ~     (p=0.363 n=10+10)
MergingIterNext/restart=16/count=3-16                   93.5ns ± 0%    93.1ns ± 1%    ~     (p=0.507 n=10+9)
MergingIterNext/restart=16/count=4-16                    104ns ± 0%     104ns ± 1%    ~     (p=0.078 n=8+10)
MergingIterNext/restart=16/count=5-16                    121ns ± 1%     121ns ± 1%  -0.52%  (p=0.008 n=10+10)
MergingIterPrev/restart=16/count=1-16                   66.6ns ± 1%    67.8ns ± 1%  +1.81%  (p=0.000 n=10+10)
MergingIterPrev/restart=16/count=2-16                   93.2ns ± 0%    94.4ns ± 1%  +1.24%  (p=0.000 n=10+10)
MergingIterPrev/restart=16/count=3-16                    114ns ± 0%     114ns ± 1%  +0.36%  (p=0.032 n=9+10)
MergingIterPrev/restart=16/count=4-16                    122ns ± 1%     123ns ± 0%  +0.41%  (p=0.014 n=10+9)
MergingIterPrev/restart=16/count=5-16                    138ns ± 1%     138ns ± 0%  +0.52%  (p=0.012 n=10+10)
MergingIterSeqSeekGEWithBounds/levelCount=5-16           572ns ± 1%     572ns ± 0%    ~     (p=0.842 n=10+9)
MergingIterSeqSeekPrefixGE/skip=1/use-next=false-16     1.85µs ± 1%    1.76µs ± 1%  -4.85%  (p=0.000 n=10+9)
MergingIterSeqSeekPrefixGE/skip=1/use-next=true-16       443ns ± 0%     444ns ± 1%    ~     (p=0.255 n=10+10)
MergingIterSeqSeekPrefixGE/skip=2/use-next=false-16     1.86µs ± 1%    1.77µs ± 1%  -4.63%  (p=0.000 n=10+10)
MergingIterSeqSeekPrefixGE/skip=2/use-next=true-16       486ns ± 1%     482ns ± 1%  -0.80%  (p=0.000 n=10+10)
MergingIterSeqSeekPrefixGE/skip=4/use-next=false-16     1.93µs ± 1%    1.83µs ± 1%  -4.95%  (p=0.000 n=10+10)
MergingIterSeqSeekPrefixGE/skip=4/use-next=true-16       570ns ± 0%     567ns ± 2%  -0.47%  (p=0.020 n=10+10)
MergingIterSeqSeekPrefixGE/skip=8/use-next=false-16     2.12µs ± 0%    2.03µs ± 1%  -4.38%  (p=0.000 n=10+10)
MergingIterSeqSeekPrefixGE/skip=8/use-next=true-16      1.43µs ± 1%    1.39µs ± 1%  -2.57%  (p=0.000 n=10+10)
MergingIterSeqSeekPrefixGE/skip=16/use-next=false-16    2.28µs ± 1%    2.18µs ± 0%  -4.54%  (p=0.000 n=10+10)
MergingIterSeqSeekPrefixGE/skip=16/use-next=true-16     1.59µs ± 1%    1.53µs ± 1%  -3.71%  (p=0.000 n=10+9)

Finally, a read benchmark where all except the first key is obsolete shows improvement.

BenchmarkIteratorScanObsolete/format=(Pebble,v3)/cache-size=1_B/hide-obsolete=false-10         	      36	  32300029 ns/op	       2 B/op	       0 allocs/op
BenchmarkIteratorScanObsolete/format=(Pebble,v3)/cache-size=1_B/hide-obsolete=true-10          	      36	  32418979 ns/op	       3 B/op	       0 allocs/op
BenchmarkIteratorScanObsolete/format=(Pebble,v3)/cache-size=150_M/hide-obsolete=false-10       	      82	  13357163 ns/op	       1 B/op	       0 allocs/op
BenchmarkIteratorScanObsolete/format=(Pebble,v3)/cache-size=150_M/hide-obsolete=true-10        	      90	  13256770 ns/op	       1 B/op	       0 allocs/op
BenchmarkIteratorScanObsolete/format=(Pebble,v4)/cache-size=1_B/hide-obsolete=false-10         	      36	  32396367 ns/op	       2 B/op	       0 allocs/op
BenchmarkIteratorScanObsolete/format=(Pebble,v4)/cache-size=1_B/hide-obsolete=true-10          	   26086	     46095 ns/op	       0 B/op	       0 allocs/op
BenchmarkIteratorScanObsolete/format=(Pebble,v4)/cache-size=150_M/hide-obsolete=false-10       	      88	  13226711 ns/op	       1 B/op	       0 allocs/op
BenchmarkIteratorScanObsolete/format=(Pebble,v4)/cache-size=150_M/hide-obsolete=true-10        	   39171	     30618 ns/op	       0 B/op	       0 allocs/op

Informs #2465

cockroach-teamcity · 2023-05-23T23:06:56Z

This change is

jbowens

flushing some comments

Reviewed 2 of 37 files at r1.
Reviewable status: 2 of 37 files reviewed, 6 unresolved discussions (waiting on @itsbilal and @sumeerbhola)

-- commits line 22 at r1:
Are there any potential new uses of MERGE keys in Cockroach that we might be sad about closing the door on?

I think if we wanted to support them, we could use add a PREMERGE key kind that holds the pre-merged operands and write both the PREMERGE (live) key and the individual MERGE (obsolete bit set) operands (including for the most recent operand). A sstable iterator would be required to return either only the contained PREMERGEs (if the iterator's seqnum is higher than the sstable's LargestSeqNum) or only the contained MERGEs (if not). PREMERGE would otherwise behave just like MERGE, (or the sstable iterator could transparently edit the kind to MERGE before surfacing).

I'm wondering if implementing PREMERGE and leaving the door open on additional MERGE usage might be preferable to implementing the 'strict' mode distinction.

-- commits line 97 at r1:
nice

sstable/block.go line 893 at r1 (raw file):

			} else {
				// k >= search key, so prune everything after index (since index
				// satisfies the property we are looking for).

thanks for adding these comments

sstable/block.go line 1078 at r1 (raw file):

		trailer := binary.LittleEndian.Uint64(i.key[n:])
		hiddenPoint = i.hideObsoletePoints &&
			(trailer&trailerObsoleteBit != 0)

nit: curious about the hiddenPoint variable and its scope; could this be

if i.hideObsoletePoints && (trailer&trailerObsoleteBit != 0) {
    goto start
}

sstable/block.go line 1365 at r1 (raw file):

		if prefixChanged || i.cmp(i.ikey.UserKey, succKey) >= 0 {
			// Prefix has changed.
			if hiddenPoint {

is this possible? if the prefix changed, it must necessarily be the first of its user key and thus not obsolete?

sstable/block.go line 1404 at r1 (raw file):

			trailer := binary.LittleEndian.Uint64(i.key[n:])
			hiddenPoint = i.hideObsoletePoints &&
				(trailer&trailerObsoleteBit != 0)

nit similar question about hiddenPoint and its scope

sumeerbhola

TFTR!

I'll start working on cleaning up the various TODOs

Reviewable status: 2 of 37 files reviewed, 5 unresolved discussions (waiting on @itsbilal and @jbowens)

-- commits line 22 at r1:
Good idea for PREMERGE. I'll add it to the issue, so that we don't forget.

Are there any potential new uses of MERGE keys in Cockroach that we might be sad about closing the door on?

I don't think anything new is coming up in this regard. We've rarely discussed supporting more sophisticated state machine operations (read-modify-write, like counters) and one could optimize application for some subset of those by avoiding the read altogether (e.g. counter), and using a merge. But the benefit of such read-modify-write operations is mostly at higher layers (reduced distributed contention footprint) and avoiding the read at the "local" Pebble layer does not factor into the costs.

I'm wondering if implementing PREMERGE and leaving the door open on additional MERGE usage might be preferable to implementing the 'strict' mode distinction.

I don't think we need that to be future proof. We can easily do something like the following sequence:

Support strict and only write strict ssts for the disagg storage key space.
New feature comes up that needs MERGE in the disagg storage key space: Add support for PREMERGE (or some other solution) in a new sstable format. New ssts for the disagg storage key space start getting written in this new format and can handle MERGE.

I'm wary of complicating with a PREMERGE when the only use case is our current timeseries.

sstable/block.go line 1078 at r1 (raw file):

Previously, jbowens (Jackson Owens) wrote…

nit: curious about the hiddenPoint variable and its scope; could this be
if i.hideObsoletePoints && (trailer&trailerObsoleteBit != 0) {
    goto start
}

Good observation. It can be in the scope of the if-block. I played around with (a) your suggestion and (b) goto in the same place (but hiddenPoint declared in the scope of the if-block). (a) is slightly slower for BenchmarkBlockIter* -- maybe it disrupts the instruction pipeline due to misprediction since the goto path never gets taken. (b) is the same as the current code. So I've changed to (b)

sstable/block.go line 1365 at r1 (raw file):

Previously, jbowens (Jackson Owens) wrote…

is this possible? if the prefix changed, it must necessarily be the first of its user key and thus not obsolete?

It can be a hidden point due to rangedels.

sstable/block.go line 1404 at r1 (raw file):

Previously, jbowens (Jackson Owens) wrote…

nit similar question about hiddenPoint and its scope

Done

sumeerbhola · 2023-06-01T19:55:17Z

I've marked this as ready for review since much of the cleanup is done. My next steps:

Add more unit tests.
Investigate the correctness TODO in compaction_iter.go.

I'll ping when these steps are done.

sumeerbhola

Add more unit tests.

Tests are ready.

Investigate the correctness TODO in compaction_iter.go.

Done

This is ready for a full review.

Reviewed 13 of 37 files at r1, 1 of 1 files at r2.
Reviewable status: 4 of 51 files reviewed, 5 unresolved discussions (waiting on @itsbilal and @jbowens)

jbowens

Reviewed 16 of 37 files at r1, 17 of 21 files at r3, 16 of 16 files at r4, 1 of 1 files at r5, all commit messages.
Reviewable status: all files reviewed, 12 unresolved discussions (waiting on @itsbilal and @sumeerbhola)

-- commits line 22 at r1:

Previously, sumeerbhola wrote…

Good idea for PREMERGE. I'll add it to the issue, so that we don't forget.

Are there any potential new uses of MERGE keys in Cockroach that we might be sad about closing the door on?

I don't think anything new is coming up in this regard. We've rarely discussed supporting more sophisticated state machine operations (read-modify-write, like counters) and one could optimize application for some subset of those by avoiding the read altogether (e.g. counter), and using a merge. But the benefit of such read-modify-write operations is mostly at higher layers (reduced distributed contention footprint) and avoiding the read at the "local" Pebble layer does not factor into the costs.

I'm wondering if implementing PREMERGE and leaving the door open on additional MERGE usage might be preferable to implementing the 'strict' mode distinction.

I don't think we need that to be future proof. We can easily do something like the following sequence:

Support strict and only write strict ssts for the disagg storage key space.

New feature comes up that needs MERGE in the disagg storage key space: Add support for PREMERGE (or some other solution) in a new sstable format. New ssts for the disagg storage key space start getting written in this new format and can handle MERGE.

I'm wary of complicating with a PREMERGE when the only use case is our current timeseries.

Sounds good

compaction_iter.go line 223 at r5 (raw file):

	// sameStripeNonSkippable occurs due to RANGEDELs that sort before
	// SET/MERGE/DEL with the same seqnum, so the RANGEDEL does not necessarily
	// delete the subsequent SET/MERGE/DEL keys.

I'm hoping to remove this wart by resurrecting this oldie: #1659. When rangedels are interleaved at the infinite sequence number, they can never appear within a stripe only before. I think that will allow us to remove sameStripeNonSkippable.

IIRC, there was some awkward bit of #1659 that I wanted to address before merging, and it died on the vein. I doubt it would be much work to resurrect through.

internal/base/internal.go line 129 at r5 (raw file):

// Assert InternalKeyKindSSTableInternalObsoleteBit > InternalKeyKindMax
const _ = uint(InternalKeyKindSSTableInternalObsoleteBit - InternalKeyKindMax - 1)

nice

sstable/block.go line 1365 at r1 (raw file):

Previously, sumeerbhola wrote…

It can be a hidden point due to rangedels.

ah, right!

sstable/block.go line 94 at r5 (raw file):

	n := len(w.curKey) - base.InternalTrailerLen
	if n < 0 {
		panic("corrupt key in blockWriter buffer")

nit: we could update this to use errors.AssertionFailedf

sstable/block.go line 633 at r5 (raw file):

	if n := len(firstKey) - 8; n >= 0 {
		// NB: we do not track whether the firstKey is obsolete since the trailer
		// of the firstKey is not used.

should this function even bother decoding the trailer? we could update i.firstKey to be just a []byte user key?

sstable/block.go line 708 at r5 (raw file):

			var k InternalKey
			if n := len(s) - 8; n >= 0 {
				k.Trailer = binary.LittleEndian.Uint64(s[n:])

this is an aside and might be an insignificant micro-optimization that's not worth the code duplication, but your optimization here has me wondering what benefit there might be to using a specialized blockIter for index blocks.

index blocks should typically be much more dense, since their values are just small block handles and the keys may be shortened by Comparer.Separator. i think this means seeking within them should be a relatively large portion of a sstable seek's cpu cost. we also know their keys' trailers never matter, so there's no need to decode them even for the returned KV. we also know their values are always inlined, so we could avoid any of the overhead from LazyValue. now that we have obsolete key kinds, we can also avoid any overhead from checking obsolete bits / masking. i think we also never call SeekLT/NextPrefix on them, so it doesn't quite double the code surface area. it's also a step towards optimizations that change the index block encoding (like omitting trailers from the encoded block).

if you think it's worth an issue, i'll file one.

sstable/block.go line 949 at r5 (raw file):

		// was hidden. Our current assumption is that if there are large numbers
		// of hidden keys we will be able to skip whole blocks (using block
		// property filters) so we don't bother optimizing.

one counter consideration here is that obsolete keys' values will now be stored out-of-band, increasing the density of keys within a block, requiring more obsolete keys to fill a whole block with obsolete keys. i'm still not sure it's worthwhile to address this optimization, but thought i'd flag it.

sstable/suffix_rewriter.go line 69 at r5 (raw file):

//     sstable to be rewritten. Validating the former condition is non-trivial
//     when rewriting blocks in parallel, so we could encode the pre-existing
//     condition in a NumObsoletePoints property.

nit: we already have the SnapshotPinnedKeys property which I think captures the same thing, so I think we'd be updating the external sst writer to calculate it and encode it.

sstable/suffix_rewriter.go line 455 at r5 (raw file):

// suffix. The is significantly slower than the parallelized rewriter, and does
// more work to rederive filters, props, etc, however re-doing that work makes
// it less restrictive -- props no longer need to ??

yeah, idk either

sumeerbhola

TFTR!

Reviewable status: 48 of 51 files reviewed, 11 unresolved discussions (waiting on @itsbilal and @jbowens)

-- commits line 22 at r1:

Previously, jbowens (Jackson Owens) wrote…

Sounds good

for the review record, the idea is recorded in #2465 (comment)

compaction_iter.go line 223 at r5 (raw file):

Previously, jbowens (Jackson Owens) wrote…

I'm hoping to remove this wart by resurrecting this oldie: #1659. When rangedels are interleaved at the infinite sequence number, they can never appear within a stripe only before. I think that will allow us to remove sameStripeNonSkippable.

IIRC, there was some awkward bit of #1659 that I wanted to address before merging, and it died on the vein. I doubt it would be much work to resurrect through.

Ack. I didn't realize that PR existed.

sstable/block.go line 94 at r5 (raw file):

Previously, jbowens (Jackson Owens) wrote…

nit: we could update this to use errors.AssertionFailedf

Done

sstable/block.go line 633 at r5 (raw file):

Previously, jbowens (Jackson Owens) wrote…

should this function even bother decoding the trailer? we could update i.firstKey to be just a []byte user key?

Done

sstable/block.go line 708 at r5 (raw file):

Previously, jbowens (Jackson Owens) wrote…

this is an aside and might be an insignificant micro-optimization that's not worth the code duplication, but your optimization here has me wondering what benefit there might be to using a specialized blockIter for index blocks.

index blocks should typically be much more dense, since their values are just small block handles and the keys may be shortened by Comparer.Separator. i think this means seeking within them should be a relatively large portion of a sstable seek's cpu cost. we also know their keys' trailers never matter, so there's no need to decode them even for the returned KV. we also know their values are always inlined, so we could avoid any of the overhead from LazyValue. now that we have obsolete key kinds, we can also avoid any overhead from checking obsolete bits / masking. i think we also never call SeekLT/NextPrefix on them, so it doesn't quite double the code surface area. it's also a step towards optimizations that change the index block encoding (like omitting trailers from the encoded block).

if you think it's worth an issue, i'll file one.

Good idea. Definitely worth an issue.

sstable/block.go line 949 at r5 (raw file):

Previously, jbowens (Jackson Owens) wrote…

one counter consideration here is that obsolete keys' values will now be stored out-of-band, increasing the density of keys within a block, requiring more obsolete keys to fill a whole block with obsolete keys. i'm still not sure it's worthwhile to address this optimization, but thought i'd flag it.

I didn't quite follow this. We are not currently storing the values out-of-band in value blocks. And the caching mentioned here is the one for SeekLT which is based on restart interval.

sstable/suffix_rewriter.go line 69 at r5 (raw file):

Previously, jbowens (Jackson Owens) wrote…

nit: we already have the SnapshotPinnedKeys property which I think captures the same thing, so I think we'd be updating the external sst writer to calculate it and encode it.

Good point. Updated the comment.

sstable/suffix_rewriter.go line 455 at r5 (raw file):

Previously, jbowens (Jackson Owens) wrote…

yeah, idk either

I dropped the sentence and added a todo earlier in the file

// TODO(sumeer): document limitations, if any, due to this limited
// re-computation of properties (is there any loss of fidelity?).

This bit marks keys that are obsolete because they are not the newest seqnum for a user key (in that sstable), or they are masked by a RANGEDEL. Setting the obsolete bit on point keys is advanced usage, so we support 2 modes, both of which must be truthful when setting the obsolete bit, but vary in when they don't set the obsolete bit. - Non-strict: In this mode, the bit does not need to be set for keys that are obsolete. Additionally, any sstable containing MERGE keys can only use this mode. An iterator over such an sstable, when configured to hideObsoletePoints, can expose multiple internal keys per user key, and can expose keys that are deleted by rangedels in the same sstable. This is the mode that non-advanced users should use. Pebble without disaggregated storage will also use this mode and will best-effort set the obsolete bit, to optimize iteration when snapshots have retained many obsolete keys. - Strict: In this mode, every obsolete key must have the obsolete bit set, and no MERGE keys are permitted. An iterator over such an sstable, when configured to hideObsoletePoints satisfies two properties: - S1: will expose at most one internal key per user key, which is the most recent one. - S2: will never expose keys that are deleted by rangedels in the same sstable. This is the mode for two use cases in disaggregated storage (which will exclude parts of the key space that has MERGEs), for levels that contain sstables that can become foreign sstables. - Pebble compaction output to these levels that can become foreign sstables. - CockroachDB ingest operations that can ingest into the levels that can become foreign sstables. Note, these are not sstables corresponding to copied data for CockroachDB range snapshots. This case occurs for operations like index backfills: these trivially satisfy the strictness criteria since they only write one key per userkey. The strictness of the sstable is written to the Properties block. The Writer implementation discovers keys that are obsolete because they are the same userkey as the previous key. This can be cheaply done since we already do user key comparisons in the Writer. For keys obsoleted by RANGEDELs, the Writer relies on the caller. On the read path, the obsolete bit is removed by the blockIter. Since everything reading an sstable uses a blockIter, this prevents any leakage of this bit. Some effort was made to reduce the regression on the iteration path, but TableIterNext has +5.84% regression. Some of the slowdown is clawed back by improvements to Seek (e.g. SeekGE is now faster). old is master: name old time/op new time/op delta BlockIterSeekGE/restart=16-16 474ns ± 1% 450ns ± 1% -5.16% (p=0.000 n=10+10) BlockIterSeekLT/restart=16-16 520ns ± 0% 526ns ± 0% +1.20% (p=0.000 n=10+10) BlockIterNext/restart=16-16 19.3ns ± 1% 21.0ns ± 0% +8.76% (p=0.000 n=10+10) BlockIterPrev/restart=16-16 38.7ns ± 1% 39.9ns ± 0% +3.20% (p=0.000 n=9+9) TableIterSeekGE/restart=16,compression=Snappy-16 1.65µs ± 1% 1.61µs ± 3% -2.24% (p=0.000 n=9+10) TableIterSeekGE/restart=16,compression=ZSTD-16 1.67µs ± 3% 1.58µs ± 3% -5.11% (p=0.000 n=10+10) TableIterSeekLT/restart=16,compression=Snappy-16 1.75µs ± 3% 1.68µs ± 2% -4.14% (p=0.000 n=10+9) TableIterSeekLT/restart=16,compression=ZSTD-16 1.74µs ± 3% 1.69µs ± 3% -2.54% (p=0.001 n=10+10) TableIterNext/restart=16,compression=Snappy-16 23.9ns ± 1% 25.3ns ± 0% +5.84% (p=0.000 n=10+10) TableIterNext/restart=16,compression=ZSTD-16 23.9ns ± 1% 25.3ns ± 0% +5.78% (p=0.000 n=10+10) TableIterPrev/restart=16,compression=Snappy-16 45.2ns ± 1% 46.2ns ± 1% +2.09% (p=0.000 n=10+10) TableIterPrev/restart=16,compression=ZSTD-16 45.3ns ± 0% 46.3ns ± 0% +2.23% (p=0.000 n=8+9) IteratorScanManyVersions/format=(Pebble,v2)/cache-size=20_M/read-value=false-16 51.7ns ± 1% 55.2ns ± 4% +6.82% (p=0.000 n=10+10) IteratorScanManyVersions/format=(Pebble,v2)/cache-size=20_M/read-value=true-16 54.9ns ± 1% 56.4ns ± 3% +2.73% (p=0.000 n=10+10) IteratorScanManyVersions/format=(Pebble,v2)/cache-size=150_M/read-value=false-16 35.0ns ± 1% 34.8ns ± 1% -0.56% (p=0.037 n=10+10) IteratorScanManyVersions/format=(Pebble,v2)/cache-size=150_M/read-value=true-16 37.8ns ± 0% 38.0ns ± 1% +0.55% (p=0.018 n=9+10) IteratorScanManyVersions/format=(Pebble,v3)/cache-size=20_M/read-value=false-16 41.5ns ± 2% 42.4ns ± 1% +2.18% (p=0.000 n=10+10) IteratorScanManyVersions/format=(Pebble,v3)/cache-size=20_M/read-value=true-16 94.7ns ± 4% 97.0ns ± 8% ~ (p=0.133 n=9+10) IteratorScanManyVersions/format=(Pebble,v3)/cache-size=150_M/read-value=false-16 35.4ns ± 2% 36.5ns ± 1% +2.97% (p=0.000 n=10+8) IteratorScanManyVersions/format=(Pebble,v3)/cache-size=150_M/read-value=true-16 60.1ns ± 1% 57.8ns ± 0% -3.84% (p=0.000 n=9+9) IteratorScanNextPrefix/versions=1/method=seek-ge/read-value=false-16 135ns ± 1% 136ns ± 1% +0.44% (p=0.009 n=9+10) IteratorScanNextPrefix/versions=1/method=seek-ge/read-value=true-16 139ns ± 0% 139ns ± 0% +0.48% (p=0.000 n=10+8) IteratorScanNextPrefix/versions=1/method=next-prefix/read-value=false-16 34.8ns ± 1% 35.5ns ± 2% +2.12% (p=0.000 n=9+10) IteratorScanNextPrefix/versions=1/method=next-prefix/read-value=true-16 37.6ns ± 0% 38.6ns ± 1% +2.53% (p=0.000 n=10+10) IteratorScanNextPrefix/versions=2/method=seek-ge/read-value=false-16 215ns ± 1% 216ns ± 0% ~ (p=0.341 n=10+10) IteratorScanNextPrefix/versions=2/method=seek-ge/read-value=true-16 220ns ± 1% 220ns ± 0% ~ (p=0.983 n=10+8) IteratorScanNextPrefix/versions=2/method=next-prefix/read-value=false-16 41.6ns ± 1% 42.6ns ± 2% +2.42% (p=0.000 n=10+10) IteratorScanNextPrefix/versions=2/method=next-prefix/read-value=true-16 44.6ns ± 1% 45.6ns ± 1% +2.28% (p=0.000 n=10+10) IteratorScanNextPrefix/versions=10/method=seek-ge/read-value=false-16 2.16µs ± 0% 2.06µs ± 1% -4.27% (p=0.000 n=10+10) IteratorScanNextPrefix/versions=10/method=seek-ge/read-value=true-16 2.15µs ± 1% 2.07µs ± 0% -3.71% (p=0.000 n=9+10) IteratorScanNextPrefix/versions=10/method=next-prefix/read-value=false-16 94.1ns ± 1% 95.9ns ± 2% +1.94% (p=0.000 n=10+10) IteratorScanNextPrefix/versions=10/method=next-prefix/read-value=true-16 97.5ns ± 1% 98.2ns ± 1% +0.69% (p=0.023 n=10+10) IteratorScanNextPrefix/versions=100/method=seek-ge/read-value=false-16 2.81µs ± 1% 2.66µs ± 1% -5.29% (p=0.000 n=9+10) IteratorScanNextPrefix/versions=100/method=seek-ge/read-value=true-16 2.82µs ± 1% 2.67µs ± 0% -5.47% (p=0.000 n=8+10) IteratorScanNextPrefix/versions=100/method=next-prefix/read-value=false-16 689ns ± 4% 652ns ± 5% -5.32% (p=0.000 n=10+10) IteratorScanNextPrefix/versions=100/method=next-prefix/read-value=true-16 694ns ± 2% 657ns ± 1% -5.28% (p=0.000 n=10+8) Looking at mergingIter, the Next regression seems tolerable, and SeekGE is better. name old time/op new time/op delta MergingIterSeekGE/restart=16/count=1-16 1.25µs ± 3% 1.15µs ± 1% -8.51% (p=0.000 n=10+10) MergingIterSeekGE/restart=16/count=2-16 2.49µs ± 2% 2.28µs ± 2% -8.39% (p=0.000 n=10+10) MergingIterSeekGE/restart=16/count=3-16 3.82µs ± 3% 3.57µs ± 1% -6.54% (p=0.000 n=10+10) MergingIterSeekGE/restart=16/count=4-16 5.31µs ± 2% 4.86µs ± 2% -8.39% (p=0.000 n=10+10) MergingIterSeekGE/restart=16/count=5-16 6.88µs ± 1% 6.36µs ± 2% -7.49% (p=0.000 n=10+10) MergingIterNext/restart=16/count=1-16 46.0ns ± 1% 46.6ns ± 1% +1.13% (p=0.000 n=10+10) MergingIterNext/restart=16/count=2-16 72.8ns ± 1% 73.0ns ± 0% ~ (p=0.363 n=10+10) MergingIterNext/restart=16/count=3-16 93.5ns ± 0% 93.1ns ± 1% ~ (p=0.507 n=10+9) MergingIterNext/restart=16/count=4-16 104ns ± 0% 104ns ± 1% ~ (p=0.078 n=8+10) MergingIterNext/restart=16/count=5-16 121ns ± 1% 121ns ± 1% -0.52% (p=0.008 n=10+10) MergingIterPrev/restart=16/count=1-16 66.6ns ± 1% 67.8ns ± 1% +1.81% (p=0.000 n=10+10) MergingIterPrev/restart=16/count=2-16 93.2ns ± 0% 94.4ns ± 1% +1.24% (p=0.000 n=10+10) MergingIterPrev/restart=16/count=3-16 114ns ± 0% 114ns ± 1% +0.36% (p=0.032 n=9+10) MergingIterPrev/restart=16/count=4-16 122ns ± 1% 123ns ± 0% +0.41% (p=0.014 n=10+9) MergingIterPrev/restart=16/count=5-16 138ns ± 1% 138ns ± 0% +0.52% (p=0.012 n=10+10) MergingIterSeqSeekGEWithBounds/levelCount=5-16 572ns ± 1% 572ns ± 0% ~ (p=0.842 n=10+9) MergingIterSeqSeekPrefixGE/skip=1/use-next=false-16 1.85µs ± 1% 1.76µs ± 1% -4.85% (p=0.000 n=10+9) MergingIterSeqSeekPrefixGE/skip=1/use-next=true-16 443ns ± 0% 444ns ± 1% ~ (p=0.255 n=10+10) MergingIterSeqSeekPrefixGE/skip=2/use-next=false-16 1.86µs ± 1% 1.77µs ± 1% -4.63% (p=0.000 n=10+10) MergingIterSeqSeekPrefixGE/skip=2/use-next=true-16 486ns ± 1% 482ns ± 1% -0.80% (p=0.000 n=10+10) MergingIterSeqSeekPrefixGE/skip=4/use-next=false-16 1.93µs ± 1% 1.83µs ± 1% -4.95% (p=0.000 n=10+10) MergingIterSeqSeekPrefixGE/skip=4/use-next=true-16 570ns ± 0% 567ns ± 2% -0.47% (p=0.020 n=10+10) MergingIterSeqSeekPrefixGE/skip=8/use-next=false-16 2.12µs ± 0% 2.03µs ± 1% -4.38% (p=0.000 n=10+10) MergingIterSeqSeekPrefixGE/skip=8/use-next=true-16 1.43µs ± 1% 1.39µs ± 1% -2.57% (p=0.000 n=10+10) MergingIterSeqSeekPrefixGE/skip=16/use-next=false-16 2.28µs ± 1% 2.18µs ± 0% -4.54% (p=0.000 n=10+10) MergingIterSeqSeekPrefixGE/skip=16/use-next=true-16 1.59µs ± 1% 1.53µs ± 1% -3.71% (p=0.000 n=10+9) Finally, a read benchmark where all except the first key is obsolete shows improvement. BenchmarkIteratorScanObsolete/format=(Pebble,v3)/cache-size=1_B/hide-obsolete=false-10 36 32300029 ns/op 2 B/op 0 allocs/op BenchmarkIteratorScanObsolete/format=(Pebble,v3)/cache-size=1_B/hide-obsolete=true-10 36 32418979 ns/op 3 B/op 0 allocs/op BenchmarkIteratorScanObsolete/format=(Pebble,v3)/cache-size=150_M/hide-obsolete=false-10 82 13357163 ns/op 1 B/op 0 allocs/op BenchmarkIteratorScanObsolete/format=(Pebble,v3)/cache-size=150_M/hide-obsolete=true-10 90 13256770 ns/op 1 B/op 0 allocs/op BenchmarkIteratorScanObsolete/format=(Pebble,v4)/cache-size=1_B/hide-obsolete=false-10 36 32396367 ns/op 2 B/op 0 allocs/op BenchmarkIteratorScanObsolete/format=(Pebble,v4)/cache-size=1_B/hide-obsolete=true-10 26086 46095 ns/op 0 B/op 0 allocs/op BenchmarkIteratorScanObsolete/format=(Pebble,v4)/cache-size=150_M/hide-obsolete=false-10 88 13226711 ns/op 1 B/op 0 allocs/op BenchmarkIteratorScanObsolete/format=(Pebble,v4)/cache-size=150_M/hide-obsolete=true-10 39171 30618 ns/op 0 B/op 0 allocs/op Informs cockroachdb#2465

sumeerbhola · 2023-06-05T02:22:34Z

flake is #2355

jbowens

Reviewable status: 48 of 51 files reviewed, 11 unresolved discussions

sstable/block.go line 708 at r5 (raw file):

Previously, sumeerbhola wrote…

Good idea. Definitely worth an issue.

for the record, filed #2592

sstable/block.go line 949 at r5 (raw file):

Previously, sumeerbhola wrote…

I didn't quite follow this. We are not currently storing the values out-of-band in value blocks. And the caching mentioned here is the one for SeekLT which is based on restart interval.

i meant that if/when we do store these values out-of-band, the likelihood of being able to skip whole blocks is reduced even whenthere are many hidden keys.

sumeerbhola requested review from a team, itsbilal and jbowens May 23, 2023 23:06

sumeerbhola marked this pull request as draft May 23, 2023 23:08

sumeerbhola mentioned this pull request May 25, 2023

*: add IngestAndExcise operation #2538

Merged

jbowens reviewed May 25, 2023

View reviewed changes

sumeerbhola commented May 30, 2023

View reviewed changes

sumeerbhola force-pushed the local_deleted branch 5 times, most recently from 301da5b to 357cda7 Compare June 1, 2023 19:52

sumeerbhola marked this pull request as ready for review June 1, 2023 19:52

sumeerbhola force-pushed the local_deleted branch 5 times, most recently from f632070 to 38f33c8 Compare June 2, 2023 16:16

sumeerbhola commented Jun 2, 2023

View reviewed changes

jbowens approved these changes Jun 2, 2023

View reviewed changes

sumeerbhola force-pushed the local_deleted branch from 38f33c8 to 152d19a Compare June 5, 2023 01:58

sumeerbhola commented Jun 5, 2023

View reviewed changes

sumeerbhola force-pushed the local_deleted branch from 152d19a to a4cfbb1 Compare June 5, 2023 02:00

sumeerbhola merged commit ec5ff92 into cockroachdb:master Jun 5, 2023

jbowens mentioned this pull request Jun 5, 2023

sstable: specialized index blockIter #2592

Closed

jbowens reviewed Jun 5, 2023

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

sstable,db: introduce a sstable-internal ObsoleteBit in the key kind#2559

sstable,db: introduce a sstable-internal ObsoleteBit in the key kind#2559
sumeerbhola merged 1 commit intocockroachdb:masterfrom
sumeerbhola:local_deleted

sumeerbhola commented May 23, 2023

Uh oh!

cockroach-teamcity commented May 23, 2023

Uh oh!

jbowens left a comment

Uh oh!

sumeerbhola left a comment

Uh oh!

sumeerbhola commented Jun 1, 2023

Uh oh!

sumeerbhola left a comment

Uh oh!

jbowens left a comment

Uh oh!

sumeerbhola left a comment

Uh oh!

sumeerbhola commented Jun 5, 2023

Uh oh!

jbowens left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

sumeerbhola commented May 23, 2023

Uh oh!

cockroach-teamcity commented May 23, 2023

Uh oh!

jbowens left a comment

Choose a reason for hiding this comment

Uh oh!

sumeerbhola left a comment

Choose a reason for hiding this comment

Uh oh!

sumeerbhola commented Jun 1, 2023

Uh oh!

sumeerbhola left a comment

Choose a reason for hiding this comment

Uh oh!

jbowens left a comment

Choose a reason for hiding this comment

Uh oh!

sumeerbhola left a comment

Choose a reason for hiding this comment

Uh oh!

sumeerbhola commented Jun 5, 2023

Uh oh!

jbowens left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants