-
Notifications
You must be signed in to change notification settings - Fork 25.6k
Add binary doc value compression with variable doc count blocks #137139
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add binary doc value compression with variable doc count blocks #137139
Conversation
Reintroduce the LZ4 binary doc values compression originally added to Lucene in LUCENE-9211. Modify so that works in ES819TSDBDocValuesFormat
| boolean success = false; | ||
| try { | ||
| tempOutput = dir.createTempOutput(data.getName(), suffix, context); | ||
| CodecUtil.writeHeader( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does it make sense to add the header/footer and then check checksum, given that we are immediately using and deleting the temp file?
server/src/main/java/org/elasticsearch/index/codec/PerFieldFormatSupplier.java
Outdated
Show resolved
Hide resolved
martijnvg
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good stuff Parker! I did a first review round.
Additionally I also think we should get real bwc test coverage, I think we can get that by adding a bwc java integration test for wildcard field type. Similar to MatchOnlyTextRollingUpgradeIT or TextRollingUpgradeIT.
server/src/main/java/org/elasticsearch/index/codec/tsdb/BinaryDVCompressionMode.java
Show resolved
Hide resolved
server/src/main/java/org/elasticsearch/index/codec/tsdb/BinaryDVCompressionMode.java
Outdated
Show resolved
Hide resolved
server/src/main/java/org/elasticsearch/index/codec/zstd/Zstd814StoredFieldsFormat.java
Outdated
Show resolved
Hide resolved
server/src/main/java/org/elasticsearch/index/codec/PerFieldFormatSupplier.java
Outdated
Show resolved
Hide resolved
...er/src/test/java/org/elasticsearch/index/codec/tsdb/es819/ES819TSDBDocValuesFormatTests.java
Outdated
Show resolved
Hide resolved
server/src/main/java/org/elasticsearch/index/codec/tsdb/es819/ES819TSDBDocValuesFormat.java
Show resolved
Hide resolved
server/src/main/java/org/elasticsearch/index/codec/tsdb/es819/ES819TSDBDocValuesConsumer.java
Outdated
Show resolved
Hide resolved
server/src/main/java/org/elasticsearch/index/codec/tsdb/es819/ES819TSDBDocValuesConsumer.java
Outdated
Show resolved
Hide resolved
server/src/main/java/org/elasticsearch/index/codec/tsdb/es819/ES819TSDBDocValuesFormat.java
Show resolved
Hide resolved
server/src/main/java/org/elasticsearch/index/codec/tsdb/es819/ES819TSDBDocValuesProducer.java
Outdated
Show resolved
Hide resolved
Kubik42
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice! I've left more questions than comments:
| success = true; | ||
| } finally { | ||
| if (success == false) { | ||
| IOUtils.closeWhileHandlingException(this); // self-close because constructor caller can't |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this should be tested
server/src/main/java/org/elasticsearch/index/codec/tsdb/es819/DelayedOffsetAccumulator.java
Outdated
Show resolved
Hide resolved
| } | ||
|
|
||
| public void doAddCompressedBinary(FieldInfo field, DocValuesProducer valuesProducer) throws IOException { | ||
| try (CompressedBinaryBlockWriter blockWriter = new CompressedBinaryBlockWriter()) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[nit] could use some comments here, explaining what each chunk of code does
server/src/main/java/org/elasticsearch/index/codec/tsdb/es819/ES819TSDBDocValuesConsumer.java
Show resolved
Hide resolved
server/src/main/java/org/elasticsearch/index/codec/tsdb/es819/ES819TSDBDocValuesConsumer.java
Outdated
Show resolved
Hide resolved
server/src/main/java/org/elasticsearch/index/codec/tsdb/es819/ES819TSDBDocValuesConsumer.java
Outdated
Show resolved
Hide resolved
server/src/main/java/org/elasticsearch/index/codec/tsdb/es819/ES819TSDBDocValuesProducer.java
Outdated
Show resolved
Hide resolved
dnhatn
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've left some comments, but this looks great. Thanks Parker!
| numDocsInCurrentBlock = uncompressedBlockLength = 0; | ||
| } | ||
|
|
||
| void compressOffsets(DataOutput output, int numDocsInCurrentBlock) throws IOException { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we encode the lengths using GroupVIntUtil#writeGroupVInts instead? I'm not sure TSDBDocValuesEncoder is suitable for encoding these offsets. Also, always padding 128 offsets may be wasteful.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point, especially after limiting the number of docs per block to 1024 the padding could be a concern. Sounds good, I'll give this a try 👍
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, so I'm seeing a slow-down with readGroupVInts on some benchmark queries. Mostly small decreases that could be noise, but some in the 25-40% range that are concerning. I'd think that GroupVIntUtil would be quite fast. Is there possibly something I'm missing in the decompression code that could speed it up? I'm currently benchmarking with uncompressed offsets to get a baseline for offset (de)compression.
| void compress(byte[] data, int uncompressedLength, DataOutput output) throws IOException { | ||
| ByteBuffer inputBuffer = ByteBuffer.wrap(data, 0, uncompressedLength); | ||
| ByteBuffersDataInput input = new ByteBuffersDataInput(List.of(inputBuffer)); | ||
| compressor.compress(input, output); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we use Zstd from NativeAccess directly to avoid copying data to an intermediate buffer before the native buffer?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My only concern is that currently this uses lucene's Compressor/CompressionMode. Which will make it easy to add other compressors. On the other hand, as we previously spoke about, it might make sense to use LZ4 to partially decompress blocks. If that is the case, we may not want to use the Compressor interface ... though I'm actually not sure either way.
Anyway, I split a hacky version of this off here, and will benchmark it to see if it's worth doing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I ran some benchmarks on the above hacky version and got some weird results. Some of the queries got a nice throughput increase. The weird part is that the Store Size increased by an amount that was not reflected in the output of disk_usage. There must be a bug in my version that is causing this.
To keep this PR small(er), what do you think about updating to using NativeAccess directly in a separate PR?
| } | ||
| } | ||
|
|
||
| void compress(byte[] data, int uncompressedLength, DataOutput output) throws IOException { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we use Zstd directly, should we also handle cases where compression does not reduce storage and store the raw bytes instead?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like this the idea of not compressing if it doesn't help. This would still apply with non-direct Zstd, right? I guess for non-direct Zstd we'd need a separate output buffer to check the length before sending to output.
I discussed some with Martijn and he suggested adding a signal byte now, which says whether or not the data is compressed. It would always be set to true now, but can will support false once we add Zstd direct, and enable this optimization. What do you think?
dnhatn
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@parkertimmins Thanks for the extra experiment. As discussed offline, we may need a follow-up after this PR, but the current state looks great.
martijnvg
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks Parker! LGTM.
server/src/main/java/org/elasticsearch/index/codec/tsdb/es819/BlockMetadataAccumulator.java
Outdated
Show resolved
Hide resolved
|
A few tests are still failing with
All logsdb tests a passing, including rolling upgrade bwc tests. Since that is the case, I'll go ahead with the merge. |
Binary doc value compression was added behind a feature flag in #137139 . This PR removes the feature flag to enable the feature.
…tic#137139) Add compression for binary doc values using Zstd and blocks with a variable number of values. Block-wise LZ4 compression for binary doc values was previously added to Lucene in LUCENE-9211. This was subsequently removed in LUCENE-9378 due to query performance issues. We investigated adding to adding the original Lucene implementation to ES in elastic#112416 and elastic#105301. This previous approach used a constant number of values per block (specifically 32 values). This is nice because it makes it very easy to map a given value index (eg docId for dense values) to the block containing it with blockId = docId / 32. Unfortunately, if values are very large we cannot reduce the number of values per block and (de)compressing a block could cause an OOM. Also, since this is a concern, we have to keep the number of values lower than ideal. This PR instead stores a variable number of documents per block. It stores a minimum of 1 document per block and stops adding values when the size of a block exceeds a threshold. Like the previous version it stores an array of address for the start of each block. Additionally, it stores a parallel array with the value index at the start of each block. When looking up a given value index, if it is not in the current block, we binary search the array of value index starts to find the blockId containing the value. Then look up the address of the block.
…38524) Binary doc value compression was added behind a feature flag in elastic#137139 . This PR removes the feature flag to enable the feature.
Add compression for binary doc values using Zstd and blocks with a variable number of values.
Block-wise LZ4 was previously added to Lucene in LUCENE-9211. This was subsequently removed in LUCENE-9378 due to query performance issues.
We investigated adding to adding the original Lucene implementation to ES in #112416 and #105301. This approach stores a constant number of values per block (specifically 32 values). This is nice because it makes it very easy to map a given value index (eg docId for dense values) to the block containing it with
blockId = docId / 32. Unfortunately, if values are very large we cannot reduce the number of values per block and (de)compressing a block could cause an OOM. Also, since this is a concern, we have to keep the number of values lower than ideal.This PR instead stores a variable number of documents per block. It stores a minimum of 1 document per block and stops adding values when the size of a block exceeds a threshold. Like the previous version is stores an array of address for the start of each block. Additionally, it stores are parallel array with the value index at the start of each block. When looking up a given value index, if it is not in the current block, we binary search the array of value index starts to find the blockId containing the value. Then look up the address of the block.