Skip to content

Conversation

@parkertimmins
Copy link
Contributor

@parkertimmins parkertimmins commented Oct 24, 2025

Add compression for binary doc values using Zstd and blocks with a variable number of values.

Block-wise LZ4 was previously added to Lucene in LUCENE-9211. This was subsequently removed in LUCENE-9378 due to query performance issues.

We investigated adding to adding the original Lucene implementation to ES in #112416 and #105301. This approach stores a constant number of values per block (specifically 32 values). This is nice because it makes it very easy to map a given value index (eg docId for dense values) to the block containing it with blockId = docId / 32. Unfortunately, if values are very large we cannot reduce the number of values per block and (de)compressing a block could cause an OOM. Also, since this is a concern, we have to keep the number of values lower than ideal.

This PR instead stores a variable number of documents per block. It stores a minimum of 1 document per block and stops adding values when the size of a block exceeds a threshold. Like the previous version is stores an array of address for the start of each block. Additionally, it stores are parallel array with the value index at the start of each block. When looking up a given value index, if it is not in the current block, we binary search the array of value index starts to find the blockId containing the value. Then look up the address of the block.

boolean success = false;
try {
tempOutput = dir.createTempOutput(data.getName(), suffix, context);
CodecUtil.writeHeader(
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does it make sense to add the header/footer and then check checksum, given that we are immediately using and deleting the temp file?

@parkertimmins parkertimmins changed the title Add binary doc value compression with variably doc count blocks Add binary doc value compression with variable doc count blocks Oct 25, 2025
Copy link
Member

@martijnvg martijnvg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good stuff Parker! I did a first review round.

Additionally I also think we should get real bwc test coverage, I think we can get that by adding a bwc java integration test for wildcard field type. Similar to MatchOnlyTextRollingUpgradeIT or TextRollingUpgradeIT.

Copy link
Contributor

@Kubik42 Kubik42 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice! I've left more questions than comments:

success = true;
} finally {
if (success == false) {
IOUtils.closeWhileHandlingException(this); // self-close because constructor caller can't
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this should be tested

}

public void doAddCompressedBinary(FieldInfo field, DocValuesProducer valuesProducer) throws IOException {
try (CompressedBinaryBlockWriter blockWriter = new CompressedBinaryBlockWriter()) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[nit] could use some comments here, explaining what each chunk of code does

Copy link
Member

@dnhatn dnhatn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've left some comments, but this looks great. Thanks Parker!

numDocsInCurrentBlock = uncompressedBlockLength = 0;
}

void compressOffsets(DataOutput output, int numDocsInCurrentBlock) throws IOException {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we encode the lengths using GroupVIntUtil#writeGroupVInts instead? I'm not sure TSDBDocValuesEncoder is suitable for encoding these offsets. Also, always padding 128 offsets may be wasteful.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point, especially after limiting the number of docs per block to 1024 the padding could be a concern. Sounds good, I'll give this a try 👍

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, so I'm seeing a slow-down with readGroupVInts on some benchmark queries. Mostly small decreases that could be noise, but some in the 25-40% range that are concerning. I'd think that GroupVIntUtil would be quite fast. Is there possibly something I'm missing in the decompression code that could speed it up? I'm currently benchmarking with uncompressed offsets to get a baseline for offset (de)compression.

void compress(byte[] data, int uncompressedLength, DataOutput output) throws IOException {
ByteBuffer inputBuffer = ByteBuffer.wrap(data, 0, uncompressedLength);
ByteBuffersDataInput input = new ByteBuffersDataInput(List.of(inputBuffer));
compressor.compress(input, output);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we use Zstd from NativeAccess directly to avoid copying data to an intermediate buffer before the native buffer?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My only concern is that currently this uses lucene's Compressor/CompressionMode. Which will make it easy to add other compressors. On the other hand, as we previously spoke about, it might make sense to use LZ4 to partially decompress blocks. If that is the case, we may not want to use the Compressor interface ... though I'm actually not sure either way.

Anyway, I split a hacky version of this off here, and will benchmark it to see if it's worth doing.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I ran some benchmarks on the above hacky version and got some weird results. Some of the queries got a nice throughput increase. The weird part is that the Store Size increased by an amount that was not reflected in the output of disk_usage. There must be a bug in my version that is causing this.

To keep this PR small(er), what do you think about updating to using NativeAccess directly in a separate PR?

}
}

void compress(byte[] data, int uncompressedLength, DataOutput output) throws IOException {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we use Zstd directly, should we also handle cases where compression does not reduce storage and store the raw bytes instead?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like this the idea of not compressing if it doesn't help. This would still apply with non-direct Zstd, right? I guess for non-direct Zstd we'd need a separate output buffer to check the length before sending to output.

I discussed some with Martijn and he suggested adding a signal byte now, which says whether or not the data is compressed. It would always be set to true now, but can will support false once we add Zstd direct, and enable this optimization. What do you think?

Copy link
Member

@dnhatn dnhatn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@parkertimmins Thanks for the extra experiment. As discussed offline, we may need a follow-up after this PR, but the current state looks great.

Copy link
Member

@martijnvg martijnvg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks Parker! LGTM.

@parkertimmins parkertimmins added >non-issue test-release Trigger CI checks against release build and removed release highlight >feature labels Nov 18, 2025
@parkertimmins
Copy link
Contributor Author

A few tests are still failing with test-release, but they are all unrelated to this change:

  • org.elasticsearch.index.mapper.vectors.DenseVectorFieldMapperTests.testKnnQuantizedFlatVectorsFormat
  • org.elasticsearch.xpack.inference.integration.SemanticTextIndexOptionsIT.testValidateIndexOptionsWithBasicLicense (same cause as reason for checkPart1 failure on this PR)
  • org.elasticsearch.xpack.esql.plugin.IndexResolutionIT.testSubqueryResolution (also failing on this PR)

All logsdb tests a passing, including rolling upgrade bwc tests. Since that is the case, I'll go ahead with the merge.

@parkertimmins parkertimmins merged commit 15709dd into elastic:main Nov 19, 2025
33 of 38 checks passed
parkertimmins added a commit that referenced this pull request Nov 25, 2025
Binary doc value compression was added behind a feature flag in #137139 .
This PR removes the feature flag to enable the feature.
ncordon pushed a commit to ncordon/elasticsearch that referenced this pull request Nov 26, 2025
…tic#137139)

Add compression for binary doc values using Zstd and blocks with a variable number of values.

Block-wise LZ4 compression for binary doc values was previously added to Lucene in LUCENE-9211. This was subsequently removed in LUCENE-9378 due to query performance issues. We investigated adding to adding the original Lucene implementation to ES in elastic#112416 and elastic#105301. This previous approach used a constant number of values per block (specifically 32 values). This is nice because it makes it very easy to map a given value index (eg docId for dense values) to the block containing it with blockId = docId / 32. Unfortunately, if values are very large we cannot reduce the number of values per block and (de)compressing a block could cause an OOM. Also, since this is a concern, we have to keep the number of values lower than ideal.

This PR instead stores a variable number of documents per block. It stores a minimum of 1 document per block and stops adding values when the size of a block exceeds a threshold. Like the previous version it stores an array of address for the start of each block. Additionally, it stores a parallel array with the value index at the start of each block. When looking up a given value index, if it is not in the current block, we binary search the array of value index starts to find the blockId containing the value. Then look up the address of the block.
ncordon pushed a commit to ncordon/elasticsearch that referenced this pull request Nov 26, 2025
…38524)

Binary doc value compression was added behind a feature flag in elastic#137139 .
This PR removes the feature flag to enable the feature.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

>non-issue :StorageEngine/Mapping The storage related side of mappings Team:StorageEngine test-release Trigger CI checks against release build v9.3.0

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants