Skip to content

Conversation

@parkertimmins
Copy link
Contributor

No description provided.

boolean success = false;
try {
tempOutput = dir.createTempOutput(data.getName(), suffix, context);
CodecUtil.writeHeader(
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does it make sense to add the header/footer and then check checksum, given that we are immediately using and deleting the temp file?

@parkertimmins parkertimmins changed the title Add binary doc value compression with variably doc count blocks Add binary doc value compression with variable doc count blocks Oct 25, 2025
boolean isBinaryDocValueField(final String field) {
if (mapperService != null) {
Mapper mapper = mapperService.mappingLookup().getMapper(field);
if (mapper != null && "wildcard".equals(mapper.typeName())) {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe this is not actually how we want to enable this feature, given that we want it for all BinaryDocValues. I'm thinking we need some sort of PerTypeDelegatingDocValueFormat, which would delegate all types to Lucene90DocValuesFormat except for binary would be delegate to ES819TSDBDocValuesFormat. Thoughts?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see this is for using compressed binary doc values out side of logsdb / tsdb use cases. Let's wait with doing this for now? Let's just always enable compressed binary doc values if tsdb doc values codec is used.

I think we should add the compressed binary doc. values compression to Lucene90DocValuesFormat in Lucene. In order to do this, I think we need to make compression more pluggable (similar to stored fields). Because zstds isn't available in Lucene and I think we need to default to lz4.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ahh gotcha. I had thought we want to enable outside logsdb /tsdb, but yeah that makes sense that the way to add this outside logsdb is to eventually add it to Lucene.

Copy link
Member

@martijnvg martijnvg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good stuff Parker! I did a first review round.

Additionally I also think we should get real bwc test coverage, I think we can get that by adding a bwc java integration test for wildcard field type. Similar to MatchOnlyTextRollingUpgradeIT or TextRollingUpgradeIT.


package org.elasticsearch.index.codec.tsdb;

public enum BinaryDVCompressionMode {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe make more use of this abstraction here? For example I think we can add methods:

  • To get the Compressor instance. For NO_COMPRESS this would return null and for the other this would return ZstdCompressor.
  • Add minBlockBytes() that returns CompressedBinaryBlockWriter#MIN_BLOCK_BYTES or -1 (for uncompressed).
  • And perhaps add a level() method that returns -1 or CompressedBinaryBlockWriter#ZSTD_LEVEL? But maybe this isn't necessary.

public enum BinaryDVCompressionMode {

NO_COMPRESS((byte) 0),
COMPRESSED_WITH_ZSTD((byte) 1);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe rename to COMPRESSED_WITH_ZSTD_LEVEL)1?

}

private static final class ZstdDecompressor extends Decompressor {
public static final class ZstdDecompressor extends Decompressor {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe move to be a top level class in this package?

boolean isBinaryDocValueField(final String field) {
if (mapperService != null) {
Mapper mapper = mapperService.mappingLookup().getMapper(field);
if (mapper != null && "wildcard".equals(mapper.typeName())) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see this is for using compressed binary doc values out side of logsdb / tsdb use cases. Let's wait with doing this for now? Let's just always enable compressed binary doc values if tsdb doc values codec is used.

I think we should add the compressed binary doc. values compression to Lucene90DocValuesFormat in Lucene. In order to do this, I think we need to make compression more pluggable (similar to stored fields). Because zstds isn't available in Lucene and I think we need to default to lz4.

public static BinaryDVCompressionMode randomBinaryCompressionMode() {
// BinaryDVCompressionMode[] modes = BinaryDVCompressionMode.values();
// return modes[random().nextInt(modes.length)];
return BinaryDVCompressionMode.COMPRESSED_WITH_ZSTD;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need to randomize between uncompressed and this one here?

import org.elasticsearch.index.codec.tsdb.BinaryDVCompressionMode;

import java.io.IOException;

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe also update the class level java doc to include that version 1 also includes compressed binary doc values?

* _docId is unused. This is because docId may not be dense.
* But we can guarantee that the lookup value is dense on the range of inserted values.
*/
void addDoc(int _docId, BytesRef v) throws IOException {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove _docId parameter? Given that it is unused.

}
}

public void doAddCompressedBinary(FieldInfo field, DocValuesProducer valuesProducer) throws IOException {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This method is missing the logic for optimized merging? (that does exist in doAddUncompressedBinary(...))

I wonder whether doAddCompressedBinary(...) and doAddUncompressedBinary(...) can be unified here. Maybe add an interface for CompressedBinaryBlockWriter, and name it something like BinaryWriter with addDoc(...), flushData() and writeMetadata() methods? For uncompressed the addDoc(...) impl would delegate to data.writeBytes(....) and the other methods would be no-op.

static final int VERSION_START = 0;
static final int VERSION_CURRENT = VERSION_START;
static final int VERSION_BINARY_DV_COMPRESSION = 1;
static final int VERSION_CURRENT = VERSION_BINARY_DV_COMPRESSION;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 to handle bwc via codec version. The alternative is forking the entire codec, but given that the change in the format isn't all over the place, I think this approach is good.

}
}

static boolean isDense(int firstDocId, int lastDocId, int length) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this change is unrelated and was for bulk loading? If so let's undo this for now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants