Skip to content

sstable,db: introduce a sstable-internal ObsoleteBit in the key kind#2559

Merged
sumeerbhola merged 1 commit intocockroachdb:masterfrom
sumeerbhola:local_deleted
Jun 5, 2023
Merged

sstable,db: introduce a sstable-internal ObsoleteBit in the key kind#2559
sumeerbhola merged 1 commit intocockroachdb:masterfrom
sumeerbhola:local_deleted

Conversation

@sumeerbhola
Copy link
Copy Markdown
Collaborator

This bit marks keys that are obsolete because they are not the newest seqnum for a user key (in that sstable), or they are masked by a RANGEDEL.

Setting the obsolete bit on point keys is advanced usage, so we support 2 modes, both of which must be truthful when setting the obsolete bit, but vary in when they don't set the obsolete bit.

  • Non-strict: In this mode, the bit does not need to be set for keys that are obsolete. Additionally, any sstable containing MERGE keys can only use this mode. An iterator over such an sstable, when configured to hideObsoletePoints, can expose multiple internal keys per user key, and can expose keys that are deleted by rangedels in the same sstable. This is the mode that non-advanced users should use. Pebble without disaggregated storage will also use this mode and will best-effort set the obsolete bit, to optimize iteration when snapshots have retained many obsolete keys.

  • Strict: In this mode, every obsolete key must have the obsolete bit set, and no MERGE keys are permitted. An iterator over such an sstable, when configured to hideObsoletePoints satisfies two properties:

    • S1: will expose at most one internal key per user key, which is the most recent one.
    • S2: will never expose keys that are deleted by rangedels in the same sstable. This is the mode for two use cases in disaggregated storage (which will exclude parts of the key space that has MERGEs), for levels that contain sstables that can become foreign sstables.
    • Pebble compaction output to these levels that can become foreign sstables.
    • CockroachDB ingest operations that can ingest into the levels that can become foreign sstables. Note, these are not sstables corresponding to copied data for CockroachDB range snapshots. This case occurs for operations like index backfills: these trivially satisfy the strictness criteria since they only write one key per userkey.

The strictness of the sstable is written to the Properties block.

The Writer implementation discovers keys that are obsolete because they are the same userkey as the previous key. This can be cheaply done since we already do user key comparisons in the Writer. For keys obsoleted by RANGEDELs, the Writer relies on the caller.

On the read path, the obsolete bit is removed by the blockIter. Since everything reading an sstable uses a blockIter, this prevents any leakage of this bit. Some effort was made to reduce the regression on the iteration path, but TableIterNext has +5.84% regression. Some of the slowdown is clawed back by improvements to Seek (e.g. SeekGE is now faster).

old is master:

name                                                                              old time/op    new time/op    delta
BlockIterSeekGE/restart=16-16                                                        474ns ± 1%     450ns ± 1%  -5.16%  (p=0.000 n=10+10)
BlockIterSeekLT/restart=16-16                                                        520ns ± 0%     526ns ± 0%  +1.20%  (p=0.000 n=10+10)
BlockIterNext/restart=16-16                                                         19.3ns ± 1%    21.0ns ± 0%  +8.76%  (p=0.000 n=10+10)
BlockIterPrev/restart=16-16                                                         38.7ns ± 1%    39.9ns ± 0%  +3.20%  (p=0.000 n=9+9)
TableIterSeekGE/restart=16,compression=Snappy-16                                    1.65µs ± 1%    1.61µs ± 3%  -2.24%  (p=0.000 n=9+10)
TableIterSeekGE/restart=16,compression=ZSTD-16                                      1.67µs ± 3%    1.58µs ± 3%  -5.11%  (p=0.000 n=10+10)
TableIterSeekLT/restart=16,compression=Snappy-16                                    1.75µs ± 3%    1.68µs ± 2%  -4.14%  (p=0.000 n=10+9)
TableIterSeekLT/restart=16,compression=ZSTD-16                                      1.74µs ± 3%    1.69µs ± 3%  -2.54%  (p=0.001 n=10+10)
TableIterNext/restart=16,compression=Snappy-16                                      23.9ns ± 1%    25.3ns ± 0%  +5.84%  (p=0.000 n=10+10)
TableIterNext/restart=16,compression=ZSTD-16                                        23.9ns ± 1%    25.3ns ± 0%  +5.78%  (p=0.000 n=10+10)
TableIterPrev/restart=16,compression=Snappy-16                                      45.2ns ± 1%    46.2ns ± 1%  +2.09%  (p=0.000 n=10+10)
TableIterPrev/restart=16,compression=ZSTD-16                                        45.3ns ± 0%    46.3ns ± 0%  +2.23%  (p=0.000 n=8+9)
IteratorScanManyVersions/format=(Pebble,v2)/cache-size=20_M/read-value=false-16     51.7ns ± 1%    55.2ns ± 4%  +6.82%  (p=0.000 n=10+10)
IteratorScanManyVersions/format=(Pebble,v2)/cache-size=20_M/read-value=true-16      54.9ns ± 1%    56.4ns ± 3%  +2.73%  (p=0.000 n=10+10)
IteratorScanManyVersions/format=(Pebble,v2)/cache-size=150_M/read-value=false-16    35.0ns ± 1%    34.8ns ± 1%  -0.56%  (p=0.037 n=10+10)
IteratorScanManyVersions/format=(Pebble,v2)/cache-size=150_M/read-value=true-16     37.8ns ± 0%    38.0ns ± 1%  +0.55%  (p=0.018 n=9+10)
IteratorScanManyVersions/format=(Pebble,v3)/cache-size=20_M/read-value=false-16     41.5ns ± 2%    42.4ns ± 1%  +2.18%  (p=0.000 n=10+10)
IteratorScanManyVersions/format=(Pebble,v3)/cache-size=20_M/read-value=true-16      94.7ns ± 4%    97.0ns ± 8%    ~     (p=0.133 n=9+10)
IteratorScanManyVersions/format=(Pebble,v3)/cache-size=150_M/read-value=false-16    35.4ns ± 2%    36.5ns ± 1%  +2.97%  (p=0.000 n=10+8)
IteratorScanManyVersions/format=(Pebble,v3)/cache-size=150_M/read-value=true-16     60.1ns ± 1%    57.8ns ± 0%  -3.84%  (p=0.000 n=9+9)
IteratorScanNextPrefix/versions=1/method=seek-ge/read-value=false-16                 135ns ± 1%     136ns ± 1%  +0.44%  (p=0.009 n=9+10)
IteratorScanNextPrefix/versions=1/method=seek-ge/read-value=true-16                  139ns ± 0%     139ns ± 0%  +0.48%  (p=0.000 n=10+8)
IteratorScanNextPrefix/versions=1/method=next-prefix/read-value=false-16            34.8ns ± 1%    35.5ns ± 2%  +2.12%  (p=0.000 n=9+10)
IteratorScanNextPrefix/versions=1/method=next-prefix/read-value=true-16             37.6ns ± 0%    38.6ns ± 1%  +2.53%  (p=0.000 n=10+10)
IteratorScanNextPrefix/versions=2/method=seek-ge/read-value=false-16                 215ns ± 1%     216ns ± 0%    ~     (p=0.341 n=10+10)
IteratorScanNextPrefix/versions=2/method=seek-ge/read-value=true-16                  220ns ± 1%     220ns ± 0%    ~     (p=0.983 n=10+8)
IteratorScanNextPrefix/versions=2/method=next-prefix/read-value=false-16            41.6ns ± 1%    42.6ns ± 2%  +2.42%  (p=0.000 n=10+10)
IteratorScanNextPrefix/versions=2/method=next-prefix/read-value=true-16             44.6ns ± 1%    45.6ns ± 1%  +2.28%  (p=0.000 n=10+10)
IteratorScanNextPrefix/versions=10/method=seek-ge/read-value=false-16               2.16µs ± 0%    2.06µs ± 1%  -4.27%  (p=0.000 n=10+10)
IteratorScanNextPrefix/versions=10/method=seek-ge/read-value=true-16                2.15µs ± 1%    2.07µs ± 0%  -3.71%  (p=0.000 n=9+10)
IteratorScanNextPrefix/versions=10/method=next-prefix/read-value=false-16           94.1ns ± 1%    95.9ns ± 2%  +1.94%  (p=0.000 n=10+10)
IteratorScanNextPrefix/versions=10/method=next-prefix/read-value=true-16            97.5ns ± 1%    98.2ns ± 1%  +0.69%  (p=0.023 n=10+10)
IteratorScanNextPrefix/versions=100/method=seek-ge/read-value=false-16              2.81µs ± 1%    2.66µs ± 1%  -5.29%  (p=0.000 n=9+10)
IteratorScanNextPrefix/versions=100/method=seek-ge/read-value=true-16               2.82µs ± 1%    2.67µs ± 0%  -5.47%  (p=0.000 n=8+10)
IteratorScanNextPrefix/versions=100/method=next-prefix/read-value=false-16           689ns ± 4%     652ns ± 5%  -5.32%  (p=0.000 n=10+10)
IteratorScanNextPrefix/versions=100/method=next-prefix/read-value=true-16            694ns ± 2%     657ns ± 1%  -5.28%  (p=0.000 n=10+8)

Looking at mergingIter, the Next regression seems tolerable, and SeekGE is better.

name                                                  old time/op    new time/op    delta
MergingIterSeekGE/restart=16/count=1-16                 1.25µs ± 3%    1.15µs ± 1%  -8.51%  (p=0.000 n=10+10)
MergingIterSeekGE/restart=16/count=2-16                 2.49µs ± 2%    2.28µs ± 2%  -8.39%  (p=0.000 n=10+10)
MergingIterSeekGE/restart=16/count=3-16                 3.82µs ± 3%    3.57µs ± 1%  -6.54%  (p=0.000 n=10+10)
MergingIterSeekGE/restart=16/count=4-16                 5.31µs ± 2%    4.86µs ± 2%  -8.39%  (p=0.000 n=10+10)
MergingIterSeekGE/restart=16/count=5-16                 6.88µs ± 1%    6.36µs ± 2%  -7.49%  (p=0.000 n=10+10)
MergingIterNext/restart=16/count=1-16                   46.0ns ± 1%    46.6ns ± 1%  +1.13%  (p=0.000 n=10+10)
MergingIterNext/restart=16/count=2-16                   72.8ns ± 1%    73.0ns ± 0%    ~     (p=0.363 n=10+10)
MergingIterNext/restart=16/count=3-16                   93.5ns ± 0%    93.1ns ± 1%    ~     (p=0.507 n=10+9)
MergingIterNext/restart=16/count=4-16                    104ns ± 0%     104ns ± 1%    ~     (p=0.078 n=8+10)
MergingIterNext/restart=16/count=5-16                    121ns ± 1%     121ns ± 1%  -0.52%  (p=0.008 n=10+10)
MergingIterPrev/restart=16/count=1-16                   66.6ns ± 1%    67.8ns ± 1%  +1.81%  (p=0.000 n=10+10)
MergingIterPrev/restart=16/count=2-16                   93.2ns ± 0%    94.4ns ± 1%  +1.24%  (p=0.000 n=10+10)
MergingIterPrev/restart=16/count=3-16                    114ns ± 0%     114ns ± 1%  +0.36%  (p=0.032 n=9+10)
MergingIterPrev/restart=16/count=4-16                    122ns ± 1%     123ns ± 0%  +0.41%  (p=0.014 n=10+9)
MergingIterPrev/restart=16/count=5-16                    138ns ± 1%     138ns ± 0%  +0.52%  (p=0.012 n=10+10)
MergingIterSeqSeekGEWithBounds/levelCount=5-16           572ns ± 1%     572ns ± 0%    ~     (p=0.842 n=10+9)
MergingIterSeqSeekPrefixGE/skip=1/use-next=false-16     1.85µs ± 1%    1.76µs ± 1%  -4.85%  (p=0.000 n=10+9)
MergingIterSeqSeekPrefixGE/skip=1/use-next=true-16       443ns ± 0%     444ns ± 1%    ~     (p=0.255 n=10+10)
MergingIterSeqSeekPrefixGE/skip=2/use-next=false-16     1.86µs ± 1%    1.77µs ± 1%  -4.63%  (p=0.000 n=10+10)
MergingIterSeqSeekPrefixGE/skip=2/use-next=true-16       486ns ± 1%     482ns ± 1%  -0.80%  (p=0.000 n=10+10)
MergingIterSeqSeekPrefixGE/skip=4/use-next=false-16     1.93µs ± 1%    1.83µs ± 1%  -4.95%  (p=0.000 n=10+10)
MergingIterSeqSeekPrefixGE/skip=4/use-next=true-16       570ns ± 0%     567ns ± 2%  -0.47%  (p=0.020 n=10+10)
MergingIterSeqSeekPrefixGE/skip=8/use-next=false-16     2.12µs ± 0%    2.03µs ± 1%  -4.38%  (p=0.000 n=10+10)
MergingIterSeqSeekPrefixGE/skip=8/use-next=true-16      1.43µs ± 1%    1.39µs ± 1%  -2.57%  (p=0.000 n=10+10)
MergingIterSeqSeekPrefixGE/skip=16/use-next=false-16    2.28µs ± 1%    2.18µs ± 0%  -4.54%  (p=0.000 n=10+10)
MergingIterSeqSeekPrefixGE/skip=16/use-next=true-16     1.59µs ± 1%    1.53µs ± 1%  -3.71%  (p=0.000 n=10+9)

Finally, a read benchmark where all except the first key is obsolete shows improvement.

BenchmarkIteratorScanObsolete/format=(Pebble,v3)/cache-size=1_B/hide-obsolete=false-10         	      36	  32300029 ns/op	       2 B/op	       0 allocs/op
BenchmarkIteratorScanObsolete/format=(Pebble,v3)/cache-size=1_B/hide-obsolete=true-10          	      36	  32418979 ns/op	       3 B/op	       0 allocs/op
BenchmarkIteratorScanObsolete/format=(Pebble,v3)/cache-size=150_M/hide-obsolete=false-10       	      82	  13357163 ns/op	       1 B/op	       0 allocs/op
BenchmarkIteratorScanObsolete/format=(Pebble,v3)/cache-size=150_M/hide-obsolete=true-10        	      90	  13256770 ns/op	       1 B/op	       0 allocs/op
BenchmarkIteratorScanObsolete/format=(Pebble,v4)/cache-size=1_B/hide-obsolete=false-10         	      36	  32396367 ns/op	       2 B/op	       0 allocs/op
BenchmarkIteratorScanObsolete/format=(Pebble,v4)/cache-size=1_B/hide-obsolete=true-10          	   26086	     46095 ns/op	       0 B/op	       0 allocs/op
BenchmarkIteratorScanObsolete/format=(Pebble,v4)/cache-size=150_M/hide-obsolete=false-10       	      88	  13226711 ns/op	       1 B/op	       0 allocs/op
BenchmarkIteratorScanObsolete/format=(Pebble,v4)/cache-size=150_M/hide-obsolete=true-10        	   39171	     30618 ns/op	       0 B/op	       0 allocs/op

Informs #2465

@sumeerbhola sumeerbhola requested review from a team, itsbilal and jbowens May 23, 2023 23:06
@cockroach-teamcity
Copy link
Copy Markdown
Member

This change is Reviewable

@sumeerbhola sumeerbhola marked this pull request as draft May 23, 2023 23:08
Copy link
Copy Markdown
Contributor

@jbowens jbowens left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

flushing some comments

Reviewed 2 of 37 files at r1.
Reviewable status: 2 of 37 files reviewed, 6 unresolved discussions (waiting on @itsbilal and @sumeerbhola)


-- commits line 22 at r1:
Are there any potential new uses of MERGE keys in Cockroach that we might be sad about closing the door on?

I think if we wanted to support them, we could use add a PREMERGE key kind that holds the pre-merged operands and write both the PREMERGE (live) key and the individual MERGE (obsolete bit set) operands (including for the most recent operand). A sstable iterator would be required to return either only the contained PREMERGEs (if the iterator's seqnum is higher than the sstable's LargestSeqNum) or only the contained MERGEs (if not). PREMERGE would otherwise behave just like MERGE, (or the sstable iterator could transparently edit the kind to MERGE before surfacing).

I'm wondering if implementing PREMERGE and leaving the door open on additional MERGE usage might be preferable to implementing the 'strict' mode distinction.


-- commits line 97 at r1:
nice


sstable/block.go line 893 at r1 (raw file):

			} else {
				// k >= search key, so prune everything after index (since index
				// satisfies the property we are looking for).

thanks for adding these comments


sstable/block.go line 1078 at r1 (raw file):

		trailer := binary.LittleEndian.Uint64(i.key[n:])
		hiddenPoint = i.hideObsoletePoints &&
			(trailer&trailerObsoleteBit != 0)

nit: curious about the hiddenPoint variable and its scope; could this be

if i.hideObsoletePoints && (trailer&trailerObsoleteBit != 0) {
    goto start
}

sstable/block.go line 1365 at r1 (raw file):

		if prefixChanged || i.cmp(i.ikey.UserKey, succKey) >= 0 {
			// Prefix has changed.
			if hiddenPoint {

is this possible? if the prefix changed, it must necessarily be the first of its user key and thus not obsolete?


sstable/block.go line 1404 at r1 (raw file):

			trailer := binary.LittleEndian.Uint64(i.key[n:])
			hiddenPoint = i.hideObsoletePoints &&
				(trailer&trailerObsoleteBit != 0)

nit similar question about hiddenPoint and its scope

Copy link
Copy Markdown
Collaborator Author

@sumeerbhola sumeerbhola left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TFTR!

I'll start working on cleaning up the various TODOs

Reviewable status: 2 of 37 files reviewed, 5 unresolved discussions (waiting on @itsbilal and @jbowens)


-- commits line 22 at r1:
Good idea for PREMERGE. I'll add it to the issue, so that we don't forget.

Are there any potential new uses of MERGE keys in Cockroach that we might be sad about closing the door on?

I don't think anything new is coming up in this regard. We've rarely discussed supporting more sophisticated state machine operations (read-modify-write, like counters) and one could optimize application for some subset of those by avoiding the read altogether (e.g. counter), and using a merge. But the benefit of such read-modify-write operations is mostly at higher layers (reduced distributed contention footprint) and avoiding the read at the "local" Pebble layer does not factor into the costs.

I'm wondering if implementing PREMERGE and leaving the door open on additional MERGE usage might be preferable to implementing the 'strict' mode distinction.

I don't think we need that to be future proof. We can easily do something like the following sequence:

  • Support strict and only write strict ssts for the disagg storage key space.
  • New feature comes up that needs MERGE in the disagg storage key space: Add support for PREMERGE (or some other solution) in a new sstable format. New ssts for the disagg storage key space start getting written in this new format and can handle MERGE.

I'm wary of complicating with a PREMERGE when the only use case is our current timeseries.


sstable/block.go line 1078 at r1 (raw file):

Previously, jbowens (Jackson Owens) wrote…

nit: curious about the hiddenPoint variable and its scope; could this be

if i.hideObsoletePoints && (trailer&trailerObsoleteBit != 0) {
    goto start
}

Good observation. It can be in the scope of the if-block. I played around with (a) your suggestion and (b) goto in the same place (but hiddenPoint declared in the scope of the if-block). (a) is slightly slower for BenchmarkBlockIter* -- maybe it disrupts the instruction pipeline due to misprediction since the goto path never gets taken. (b) is the same as the current code. So I've changed to (b)


sstable/block.go line 1365 at r1 (raw file):

Previously, jbowens (Jackson Owens) wrote…

is this possible? if the prefix changed, it must necessarily be the first of its user key and thus not obsolete?

It can be a hidden point due to rangedels.


sstable/block.go line 1404 at r1 (raw file):

Previously, jbowens (Jackson Owens) wrote…

nit similar question about hiddenPoint and its scope

Done

@sumeerbhola sumeerbhola force-pushed the local_deleted branch 5 times, most recently from 301da5b to 357cda7 Compare June 1, 2023 19:52
@sumeerbhola sumeerbhola marked this pull request as ready for review June 1, 2023 19:52
@sumeerbhola
Copy link
Copy Markdown
Collaborator Author

I've marked this as ready for review since much of the cleanup is done. My next steps:

  • Add more unit tests.
  • Investigate the correctness TODO in compaction_iter.go.

I'll ping when these steps are done.

@sumeerbhola sumeerbhola force-pushed the local_deleted branch 5 times, most recently from f632070 to 38f33c8 Compare June 2, 2023 16:16
Copy link
Copy Markdown
Collaborator Author

@sumeerbhola sumeerbhola left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add more unit tests.

Tests are ready.

Investigate the correctness TODO in compaction_iter.go.

Done

This is ready for a full review.

Reviewed 13 of 37 files at r1, 1 of 1 files at r2.
Reviewable status: 4 of 51 files reviewed, 5 unresolved discussions (waiting on @itsbilal and @jbowens)

Copy link
Copy Markdown
Contributor

@jbowens jbowens left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm_strong:

Reviewed 16 of 37 files at r1, 17 of 21 files at r3, 16 of 16 files at r4, 1 of 1 files at r5, all commit messages.
Reviewable status: all files reviewed, 12 unresolved discussions (waiting on @itsbilal and @sumeerbhola)


-- commits line 22 at r1:

Previously, sumeerbhola wrote…

Good idea for PREMERGE. I'll add it to the issue, so that we don't forget.

Are there any potential new uses of MERGE keys in Cockroach that we might be sad about closing the door on?

I don't think anything new is coming up in this regard. We've rarely discussed supporting more sophisticated state machine operations (read-modify-write, like counters) and one could optimize application for some subset of those by avoiding the read altogether (e.g. counter), and using a merge. But the benefit of such read-modify-write operations is mostly at higher layers (reduced distributed contention footprint) and avoiding the read at the "local" Pebble layer does not factor into the costs.

I'm wondering if implementing PREMERGE and leaving the door open on additional MERGE usage might be preferable to implementing the 'strict' mode distinction.

I don't think we need that to be future proof. We can easily do something like the following sequence:

  • Support strict and only write strict ssts for the disagg storage key space.
  • New feature comes up that needs MERGE in the disagg storage key space: Add support for PREMERGE (or some other solution) in a new sstable format. New ssts for the disagg storage key space start getting written in this new format and can handle MERGE.

I'm wary of complicating with a PREMERGE when the only use case is our current timeseries.

Sounds good


compaction_iter.go line 223 at r5 (raw file):

	// sameStripeNonSkippable occurs due to RANGEDELs that sort before
	// SET/MERGE/DEL with the same seqnum, so the RANGEDEL does not necessarily
	// delete the subsequent SET/MERGE/DEL keys.

I'm hoping to remove this wart by resurrecting this oldie: #1659. When rangedels are interleaved at the infinite sequence number, they can never appear within a stripe only before. I think that will allow us to remove sameStripeNonSkippable.

IIRC, there was some awkward bit of #1659 that I wanted to address before merging, and it died on the vein. I doubt it would be much work to resurrect through.


internal/base/internal.go line 129 at r5 (raw file):

// Assert InternalKeyKindSSTableInternalObsoleteBit > InternalKeyKindMax
const _ = uint(InternalKeyKindSSTableInternalObsoleteBit - InternalKeyKindMax - 1)

nice


sstable/block.go line 1365 at r1 (raw file):

Previously, sumeerbhola wrote…

It can be a hidden point due to rangedels.

ah, right!


sstable/block.go line 94 at r5 (raw file):

	n := len(w.curKey) - base.InternalTrailerLen
	if n < 0 {
		panic("corrupt key in blockWriter buffer")

nit: we could update this to use errors.AssertionFailedf


sstable/block.go line 633 at r5 (raw file):

	if n := len(firstKey) - 8; n >= 0 {
		// NB: we do not track whether the firstKey is obsolete since the trailer
		// of the firstKey is not used.

should this function even bother decoding the trailer? we could update i.firstKey to be just a []byte user key?


sstable/block.go line 708 at r5 (raw file):

			var k InternalKey
			if n := len(s) - 8; n >= 0 {
				k.Trailer = binary.LittleEndian.Uint64(s[n:])

this is an aside and might be an insignificant micro-optimization that's not worth the code duplication, but your optimization here has me wondering what benefit there might be to using a specialized blockIter for index blocks.

index blocks should typically be much more dense, since their values are just small block handles and the keys may be shortened by Comparer.Separator. i think this means seeking within them should be a relatively large portion of a sstable seek's cpu cost. we also know their keys' trailers never matter, so there's no need to decode them even for the returned KV. we also know their values are always inlined, so we could avoid any of the overhead from LazyValue. now that we have obsolete key kinds, we can also avoid any overhead from checking obsolete bits / masking. i think we also never call SeekLT/NextPrefix on them, so it doesn't quite double the code surface area. it's also a step towards optimizations that change the index block encoding (like omitting trailers from the encoded block).

if you think it's worth an issue, i'll file one.


sstable/block.go line 949 at r5 (raw file):

		// was hidden. Our current assumption is that if there are large numbers
		// of hidden keys we will be able to skip whole blocks (using block
		// property filters) so we don't bother optimizing.

one counter consideration here is that obsolete keys' values will now be stored out-of-band, increasing the density of keys within a block, requiring more obsolete keys to fill a whole block with obsolete keys. i'm still not sure it's worthwhile to address this optimization, but thought i'd flag it.


sstable/suffix_rewriter.go line 69 at r5 (raw file):

//     sstable to be rewritten. Validating the former condition is non-trivial
//     when rewriting blocks in parallel, so we could encode the pre-existing
//     condition in a NumObsoletePoints property.

nit: we already have the SnapshotPinnedKeys property which I think captures the same thing, so I think we'd be updating the external sst writer to calculate it and encode it.


sstable/suffix_rewriter.go line 455 at r5 (raw file):

// suffix. The is significantly slower than the parallelized rewriter, and does
// more work to rederive filters, props, etc, however re-doing that work makes
// it less restrictive -- props no longer need to ??

yeah, idk either

Copy link
Copy Markdown
Collaborator Author

@sumeerbhola sumeerbhola left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TFTR!

Reviewable status: 48 of 51 files reviewed, 11 unresolved discussions (waiting on @itsbilal and @jbowens)


-- commits line 22 at r1:

Previously, jbowens (Jackson Owens) wrote…

Sounds good

for the review record, the idea is recorded in #2465 (comment)


compaction_iter.go line 223 at r5 (raw file):

Previously, jbowens (Jackson Owens) wrote…

I'm hoping to remove this wart by resurrecting this oldie: #1659. When rangedels are interleaved at the infinite sequence number, they can never appear within a stripe only before. I think that will allow us to remove sameStripeNonSkippable.

IIRC, there was some awkward bit of #1659 that I wanted to address before merging, and it died on the vein. I doubt it would be much work to resurrect through.

Ack. I didn't realize that PR existed.


sstable/block.go line 94 at r5 (raw file):

Previously, jbowens (Jackson Owens) wrote…

nit: we could update this to use errors.AssertionFailedf

Done


sstable/block.go line 633 at r5 (raw file):

Previously, jbowens (Jackson Owens) wrote…

should this function even bother decoding the trailer? we could update i.firstKey to be just a []byte user key?

Done


sstable/block.go line 708 at r5 (raw file):

Previously, jbowens (Jackson Owens) wrote…

this is an aside and might be an insignificant micro-optimization that's not worth the code duplication, but your optimization here has me wondering what benefit there might be to using a specialized blockIter for index blocks.

index blocks should typically be much more dense, since their values are just small block handles and the keys may be shortened by Comparer.Separator. i think this means seeking within them should be a relatively large portion of a sstable seek's cpu cost. we also know their keys' trailers never matter, so there's no need to decode them even for the returned KV. we also know their values are always inlined, so we could avoid any of the overhead from LazyValue. now that we have obsolete key kinds, we can also avoid any overhead from checking obsolete bits / masking. i think we also never call SeekLT/NextPrefix on them, so it doesn't quite double the code surface area. it's also a step towards optimizations that change the index block encoding (like omitting trailers from the encoded block).

if you think it's worth an issue, i'll file one.

Good idea. Definitely worth an issue.


sstable/block.go line 949 at r5 (raw file):

Previously, jbowens (Jackson Owens) wrote…

one counter consideration here is that obsolete keys' values will now be stored out-of-band, increasing the density of keys within a block, requiring more obsolete keys to fill a whole block with obsolete keys. i'm still not sure it's worthwhile to address this optimization, but thought i'd flag it.

I didn't quite follow this. We are not currently storing the values out-of-band in value blocks. And the caching mentioned here is the one for SeekLT which is based on restart interval.


sstable/suffix_rewriter.go line 69 at r5 (raw file):

Previously, jbowens (Jackson Owens) wrote…

nit: we already have the SnapshotPinnedKeys property which I think captures the same thing, so I think we'd be updating the external sst writer to calculate it and encode it.

Good point. Updated the comment.


sstable/suffix_rewriter.go line 455 at r5 (raw file):

Previously, jbowens (Jackson Owens) wrote…

yeah, idk either

I dropped the sentence and added a todo earlier in the file

// TODO(sumeer): document limitations, if any, due to this limited
// re-computation of properties (is there any loss of fidelity?).

This bit marks keys that are obsolete because they are not the newest
seqnum for a user key (in that sstable), or they are masked by a
RANGEDEL.

Setting the obsolete bit on point keys is advanced usage, so we support 2
modes, both of which must be truthful when setting the obsolete bit, but
vary in when they don't set the obsolete bit.
- Non-strict: In this mode, the bit does not need to be set for keys that
  are obsolete. Additionally, any sstable containing MERGE keys can only
  use this mode. An iterator over such an sstable, when configured to
  hideObsoletePoints, can expose multiple internal keys per user key, and
  can expose keys that are deleted by rangedels in the same sstable. This
  is the mode that non-advanced users should use. Pebble without
  disaggregated storage will also use this mode and will best-effort set
  the obsolete bit, to optimize iteration when snapshots have retained many
  obsolete keys.

- Strict: In this mode, every obsolete key must have the obsolete bit set,
  and no MERGE keys are permitted. An iterator over such an sstable, when
  configured to hideObsoletePoints satisfies two properties:
  - S1: will expose at most one internal key per user key, which is the
    most recent one.
  - S2: will never expose keys that are deleted by rangedels in the same
    sstable.
  This is the mode for two use cases in disaggregated storage (which will
  exclude parts of the key space that has MERGEs), for levels that contain
  sstables that can become foreign sstables.
  - Pebble compaction output to these levels that can become foreign
    sstables.
  - CockroachDB ingest operations that can ingest into the levels that can
    become foreign sstables. Note, these are not sstables corresponding to
    copied data for CockroachDB range snapshots. This case occurs for
    operations like index backfills: these trivially satisfy the strictness
    criteria since they only write one key per userkey.

The strictness of the sstable is written to the Properties block.

The Writer implementation discovers keys that are obsolete because they
are the same userkey as the previous key. This can be cheaply done since
we already do user key comparisons in the Writer. For keys obsoleted by
RANGEDELs, the Writer relies on the caller.

On the read path, the obsolete bit is removed by the blockIter. Since
everything reading an sstable uses a blockIter, this prevents any leakage
of this bit. Some effort was made to reduce the regression on the
iteration path, but TableIterNext has +5.84% regression. Some of the
slowdown is clawed back by improvements to Seek (e.g. SeekGE is now faster).

old is master:

name                                                                              old time/op    new time/op    delta
BlockIterSeekGE/restart=16-16                                                        474ns ± 1%     450ns ± 1%  -5.16%  (p=0.000 n=10+10)
BlockIterSeekLT/restart=16-16                                                        520ns ± 0%     526ns ± 0%  +1.20%  (p=0.000 n=10+10)
BlockIterNext/restart=16-16                                                         19.3ns ± 1%    21.0ns ± 0%  +8.76%  (p=0.000 n=10+10)
BlockIterPrev/restart=16-16                                                         38.7ns ± 1%    39.9ns ± 0%  +3.20%  (p=0.000 n=9+9)
TableIterSeekGE/restart=16,compression=Snappy-16                                    1.65µs ± 1%    1.61µs ± 3%  -2.24%  (p=0.000 n=9+10)
TableIterSeekGE/restart=16,compression=ZSTD-16                                      1.67µs ± 3%    1.58µs ± 3%  -5.11%  (p=0.000 n=10+10)
TableIterSeekLT/restart=16,compression=Snappy-16                                    1.75µs ± 3%    1.68µs ± 2%  -4.14%  (p=0.000 n=10+9)
TableIterSeekLT/restart=16,compression=ZSTD-16                                      1.74µs ± 3%    1.69µs ± 3%  -2.54%  (p=0.001 n=10+10)
TableIterNext/restart=16,compression=Snappy-16                                      23.9ns ± 1%    25.3ns ± 0%  +5.84%  (p=0.000 n=10+10)
TableIterNext/restart=16,compression=ZSTD-16                                        23.9ns ± 1%    25.3ns ± 0%  +5.78%  (p=0.000 n=10+10)
TableIterPrev/restart=16,compression=Snappy-16                                      45.2ns ± 1%    46.2ns ± 1%  +2.09%  (p=0.000 n=10+10)
TableIterPrev/restart=16,compression=ZSTD-16                                        45.3ns ± 0%    46.3ns ± 0%  +2.23%  (p=0.000 n=8+9)
IteratorScanManyVersions/format=(Pebble,v2)/cache-size=20_M/read-value=false-16     51.7ns ± 1%    55.2ns ± 4%  +6.82%  (p=0.000 n=10+10)
IteratorScanManyVersions/format=(Pebble,v2)/cache-size=20_M/read-value=true-16      54.9ns ± 1%    56.4ns ± 3%  +2.73%  (p=0.000 n=10+10)
IteratorScanManyVersions/format=(Pebble,v2)/cache-size=150_M/read-value=false-16    35.0ns ± 1%    34.8ns ± 1%  -0.56%  (p=0.037 n=10+10)
IteratorScanManyVersions/format=(Pebble,v2)/cache-size=150_M/read-value=true-16     37.8ns ± 0%    38.0ns ± 1%  +0.55%  (p=0.018 n=9+10)
IteratorScanManyVersions/format=(Pebble,v3)/cache-size=20_M/read-value=false-16     41.5ns ± 2%    42.4ns ± 1%  +2.18%  (p=0.000 n=10+10)
IteratorScanManyVersions/format=(Pebble,v3)/cache-size=20_M/read-value=true-16      94.7ns ± 4%    97.0ns ± 8%    ~     (p=0.133 n=9+10)
IteratorScanManyVersions/format=(Pebble,v3)/cache-size=150_M/read-value=false-16    35.4ns ± 2%    36.5ns ± 1%  +2.97%  (p=0.000 n=10+8)
IteratorScanManyVersions/format=(Pebble,v3)/cache-size=150_M/read-value=true-16     60.1ns ± 1%    57.8ns ± 0%  -3.84%  (p=0.000 n=9+9)
IteratorScanNextPrefix/versions=1/method=seek-ge/read-value=false-16                 135ns ± 1%     136ns ± 1%  +0.44%  (p=0.009 n=9+10)
IteratorScanNextPrefix/versions=1/method=seek-ge/read-value=true-16                  139ns ± 0%     139ns ± 0%  +0.48%  (p=0.000 n=10+8)
IteratorScanNextPrefix/versions=1/method=next-prefix/read-value=false-16            34.8ns ± 1%    35.5ns ± 2%  +2.12%  (p=0.000 n=9+10)
IteratorScanNextPrefix/versions=1/method=next-prefix/read-value=true-16             37.6ns ± 0%    38.6ns ± 1%  +2.53%  (p=0.000 n=10+10)
IteratorScanNextPrefix/versions=2/method=seek-ge/read-value=false-16                 215ns ± 1%     216ns ± 0%    ~     (p=0.341 n=10+10)
IteratorScanNextPrefix/versions=2/method=seek-ge/read-value=true-16                  220ns ± 1%     220ns ± 0%    ~     (p=0.983 n=10+8)
IteratorScanNextPrefix/versions=2/method=next-prefix/read-value=false-16            41.6ns ± 1%    42.6ns ± 2%  +2.42%  (p=0.000 n=10+10)
IteratorScanNextPrefix/versions=2/method=next-prefix/read-value=true-16             44.6ns ± 1%    45.6ns ± 1%  +2.28%  (p=0.000 n=10+10)
IteratorScanNextPrefix/versions=10/method=seek-ge/read-value=false-16               2.16µs ± 0%    2.06µs ± 1%  -4.27%  (p=0.000 n=10+10)
IteratorScanNextPrefix/versions=10/method=seek-ge/read-value=true-16                2.15µs ± 1%    2.07µs ± 0%  -3.71%  (p=0.000 n=9+10)
IteratorScanNextPrefix/versions=10/method=next-prefix/read-value=false-16           94.1ns ± 1%    95.9ns ± 2%  +1.94%  (p=0.000 n=10+10)
IteratorScanNextPrefix/versions=10/method=next-prefix/read-value=true-16            97.5ns ± 1%    98.2ns ± 1%  +0.69%  (p=0.023 n=10+10)
IteratorScanNextPrefix/versions=100/method=seek-ge/read-value=false-16              2.81µs ± 1%    2.66µs ± 1%  -5.29%  (p=0.000 n=9+10)
IteratorScanNextPrefix/versions=100/method=seek-ge/read-value=true-16               2.82µs ± 1%    2.67µs ± 0%  -5.47%  (p=0.000 n=8+10)
IteratorScanNextPrefix/versions=100/method=next-prefix/read-value=false-16           689ns ± 4%     652ns ± 5%  -5.32%  (p=0.000 n=10+10)
IteratorScanNextPrefix/versions=100/method=next-prefix/read-value=true-16            694ns ± 2%     657ns ± 1%  -5.28%  (p=0.000 n=10+8)

Looking at mergingIter, the Next regression seems tolerable, and SeekGE
is better.

name                                                  old time/op    new time/op    delta
MergingIterSeekGE/restart=16/count=1-16                 1.25µs ± 3%    1.15µs ± 1%  -8.51%  (p=0.000 n=10+10)
MergingIterSeekGE/restart=16/count=2-16                 2.49µs ± 2%    2.28µs ± 2%  -8.39%  (p=0.000 n=10+10)
MergingIterSeekGE/restart=16/count=3-16                 3.82µs ± 3%    3.57µs ± 1%  -6.54%  (p=0.000 n=10+10)
MergingIterSeekGE/restart=16/count=4-16                 5.31µs ± 2%    4.86µs ± 2%  -8.39%  (p=0.000 n=10+10)
MergingIterSeekGE/restart=16/count=5-16                 6.88µs ± 1%    6.36µs ± 2%  -7.49%  (p=0.000 n=10+10)
MergingIterNext/restart=16/count=1-16                   46.0ns ± 1%    46.6ns ± 1%  +1.13%  (p=0.000 n=10+10)
MergingIterNext/restart=16/count=2-16                   72.8ns ± 1%    73.0ns ± 0%    ~     (p=0.363 n=10+10)
MergingIterNext/restart=16/count=3-16                   93.5ns ± 0%    93.1ns ± 1%    ~     (p=0.507 n=10+9)
MergingIterNext/restart=16/count=4-16                    104ns ± 0%     104ns ± 1%    ~     (p=0.078 n=8+10)
MergingIterNext/restart=16/count=5-16                    121ns ± 1%     121ns ± 1%  -0.52%  (p=0.008 n=10+10)
MergingIterPrev/restart=16/count=1-16                   66.6ns ± 1%    67.8ns ± 1%  +1.81%  (p=0.000 n=10+10)
MergingIterPrev/restart=16/count=2-16                   93.2ns ± 0%    94.4ns ± 1%  +1.24%  (p=0.000 n=10+10)
MergingIterPrev/restart=16/count=3-16                    114ns ± 0%     114ns ± 1%  +0.36%  (p=0.032 n=9+10)
MergingIterPrev/restart=16/count=4-16                    122ns ± 1%     123ns ± 0%  +0.41%  (p=0.014 n=10+9)
MergingIterPrev/restart=16/count=5-16                    138ns ± 1%     138ns ± 0%  +0.52%  (p=0.012 n=10+10)
MergingIterSeqSeekGEWithBounds/levelCount=5-16           572ns ± 1%     572ns ± 0%    ~     (p=0.842 n=10+9)
MergingIterSeqSeekPrefixGE/skip=1/use-next=false-16     1.85µs ± 1%    1.76µs ± 1%  -4.85%  (p=0.000 n=10+9)
MergingIterSeqSeekPrefixGE/skip=1/use-next=true-16       443ns ± 0%     444ns ± 1%    ~     (p=0.255 n=10+10)
MergingIterSeqSeekPrefixGE/skip=2/use-next=false-16     1.86µs ± 1%    1.77µs ± 1%  -4.63%  (p=0.000 n=10+10)
MergingIterSeqSeekPrefixGE/skip=2/use-next=true-16       486ns ± 1%     482ns ± 1%  -0.80%  (p=0.000 n=10+10)
MergingIterSeqSeekPrefixGE/skip=4/use-next=false-16     1.93µs ± 1%    1.83µs ± 1%  -4.95%  (p=0.000 n=10+10)
MergingIterSeqSeekPrefixGE/skip=4/use-next=true-16       570ns ± 0%     567ns ± 2%  -0.47%  (p=0.020 n=10+10)
MergingIterSeqSeekPrefixGE/skip=8/use-next=false-16     2.12µs ± 0%    2.03µs ± 1%  -4.38%  (p=0.000 n=10+10)
MergingIterSeqSeekPrefixGE/skip=8/use-next=true-16      1.43µs ± 1%    1.39µs ± 1%  -2.57%  (p=0.000 n=10+10)
MergingIterSeqSeekPrefixGE/skip=16/use-next=false-16    2.28µs ± 1%    2.18µs ± 0%  -4.54%  (p=0.000 n=10+10)
MergingIterSeqSeekPrefixGE/skip=16/use-next=true-16     1.59µs ± 1%    1.53µs ± 1%  -3.71%  (p=0.000 n=10+9)

Finally, a read benchmark where all except the first key is obsolete
shows improvement.

BenchmarkIteratorScanObsolete/format=(Pebble,v3)/cache-size=1_B/hide-obsolete=false-10         	      36	  32300029 ns/op	       2 B/op	       0 allocs/op
BenchmarkIteratorScanObsolete/format=(Pebble,v3)/cache-size=1_B/hide-obsolete=true-10          	      36	  32418979 ns/op	       3 B/op	       0 allocs/op
BenchmarkIteratorScanObsolete/format=(Pebble,v3)/cache-size=150_M/hide-obsolete=false-10       	      82	  13357163 ns/op	       1 B/op	       0 allocs/op
BenchmarkIteratorScanObsolete/format=(Pebble,v3)/cache-size=150_M/hide-obsolete=true-10        	      90	  13256770 ns/op	       1 B/op	       0 allocs/op
BenchmarkIteratorScanObsolete/format=(Pebble,v4)/cache-size=1_B/hide-obsolete=false-10         	      36	  32396367 ns/op	       2 B/op	       0 allocs/op
BenchmarkIteratorScanObsolete/format=(Pebble,v4)/cache-size=1_B/hide-obsolete=true-10          	   26086	     46095 ns/op	       0 B/op	       0 allocs/op
BenchmarkIteratorScanObsolete/format=(Pebble,v4)/cache-size=150_M/hide-obsolete=false-10       	      88	  13226711 ns/op	       1 B/op	       0 allocs/op
BenchmarkIteratorScanObsolete/format=(Pebble,v4)/cache-size=150_M/hide-obsolete=true-10        	   39171	     30618 ns/op	       0 B/op	       0 allocs/op

Informs cockroachdb#2465
@sumeerbhola
Copy link
Copy Markdown
Collaborator Author

flake is #2355

@sumeerbhola sumeerbhola merged commit ec5ff92 into cockroachdb:master Jun 5, 2023
Copy link
Copy Markdown
Contributor

@jbowens jbowens left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 48 of 51 files reviewed, 11 unresolved discussions


sstable/block.go line 708 at r5 (raw file):

Previously, sumeerbhola wrote…

Good idea. Definitely worth an issue.

for the record, filed #2592


sstable/block.go line 949 at r5 (raw file):

Previously, sumeerbhola wrote…

I didn't quite follow this. We are not currently storing the values out-of-band in value blocks. And the caching mentioned here is the one for SeekLT which is based on restart interval.

i meant that if/when we do store these values out-of-band, the likelihood of being able to skip whole blocks is reduced even whenthere are many hidden keys.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants