-
Couldn't load subscription status.
- Fork 25.6k
Add binary doc value compression with variable doc count blocks #137139
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Add binary doc value compression with variable doc count blocks #137139
Conversation
Reintroduce the LZ4 binary doc values compression originally added to Lucene in LUCENE-9211. Modify so that works in ES819TSDBDocValuesFormat
| boolean success = false; | ||
| try { | ||
| tempOutput = dir.createTempOutput(data.getName(), suffix, context); | ||
| CodecUtil.writeHeader( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does it make sense to add the header/footer and then check checksum, given that we are immediately using and deleting the temp file?
| boolean isBinaryDocValueField(final String field) { | ||
| if (mapperService != null) { | ||
| Mapper mapper = mapperService.mappingLookup().getMapper(field); | ||
| if (mapper != null && "wildcard".equals(mapper.typeName())) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe this is not actually how we want to enable this feature, given that we want it for all BinaryDocValues. I'm thinking we need some sort of PerTypeDelegatingDocValueFormat, which would delegate all types to Lucene90DocValuesFormat except for binary would be delegate to ES819TSDBDocValuesFormat. Thoughts?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see this is for using compressed binary doc values out side of logsdb / tsdb use cases. Let's wait with doing this for now? Let's just always enable compressed binary doc values if tsdb doc values codec is used.
I think we should add the compressed binary doc. values compression to Lucene90DocValuesFormat in Lucene. In order to do this, I think we need to make compression more pluggable (similar to stored fields). Because zstds isn't available in Lucene and I think we need to default to lz4.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ahh gotcha. I had thought we want to enable outside logsdb /tsdb, but yeah that makes sense that the way to add this outside logsdb is to eventually add it to Lucene.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good stuff Parker! I did a first review round.
Additionally I also think we should get real bwc test coverage, I think we can get that by adding a bwc java integration test for wildcard field type. Similar to MatchOnlyTextRollingUpgradeIT or TextRollingUpgradeIT.
|
|
||
| package org.elasticsearch.index.codec.tsdb; | ||
|
|
||
| public enum BinaryDVCompressionMode { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe make more use of this abstraction here? For example I think we can add methods:
- To get the
Compressorinstance. For NO_COMPRESS this would return null and for the other this would returnZstdCompressor. - Add minBlockBytes() that returns CompressedBinaryBlockWriter#MIN_BLOCK_BYTES or -1 (for uncompressed).
- And perhaps add a level() method that returns -1 or CompressedBinaryBlockWriter#ZSTD_LEVEL? But maybe this isn't necessary.
| public enum BinaryDVCompressionMode { | ||
|
|
||
| NO_COMPRESS((byte) 0), | ||
| COMPRESSED_WITH_ZSTD((byte) 1); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe rename to COMPRESSED_WITH_ZSTD_LEVEL)1?
| } | ||
|
|
||
| private static final class ZstdDecompressor extends Decompressor { | ||
| public static final class ZstdDecompressor extends Decompressor { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe move to be a top level class in this package?
| boolean isBinaryDocValueField(final String field) { | ||
| if (mapperService != null) { | ||
| Mapper mapper = mapperService.mappingLookup().getMapper(field); | ||
| if (mapper != null && "wildcard".equals(mapper.typeName())) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see this is for using compressed binary doc values out side of logsdb / tsdb use cases. Let's wait with doing this for now? Let's just always enable compressed binary doc values if tsdb doc values codec is used.
I think we should add the compressed binary doc. values compression to Lucene90DocValuesFormat in Lucene. In order to do this, I think we need to make compression more pluggable (similar to stored fields). Because zstds isn't available in Lucene and I think we need to default to lz4.
| public static BinaryDVCompressionMode randomBinaryCompressionMode() { | ||
| // BinaryDVCompressionMode[] modes = BinaryDVCompressionMode.values(); | ||
| // return modes[random().nextInt(modes.length)]; | ||
| return BinaryDVCompressionMode.COMPRESSED_WITH_ZSTD; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We need to randomize between uncompressed and this one here?
| import org.elasticsearch.index.codec.tsdb.BinaryDVCompressionMode; | ||
|
|
||
| import java.io.IOException; | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe also update the class level java doc to include that version 1 also includes compressed binary doc values?
| * _docId is unused. This is because docId may not be dense. | ||
| * But we can guarantee that the lookup value is dense on the range of inserted values. | ||
| */ | ||
| void addDoc(int _docId, BytesRef v) throws IOException { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Remove _docId parameter? Given that it is unused.
| } | ||
| } | ||
|
|
||
| public void doAddCompressedBinary(FieldInfo field, DocValuesProducer valuesProducer) throws IOException { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This method is missing the logic for optimized merging? (that does exist in doAddUncompressedBinary(...))
I wonder whether doAddCompressedBinary(...) and doAddUncompressedBinary(...) can be unified here. Maybe add an interface for CompressedBinaryBlockWriter, and name it something like BinaryWriter with addDoc(...), flushData() and writeMetadata() methods? For uncompressed the addDoc(...) impl would delegate to data.writeBytes(....) and the other methods would be no-op.
| static final int VERSION_START = 0; | ||
| static final int VERSION_CURRENT = VERSION_START; | ||
| static final int VERSION_BINARY_DV_COMPRESSION = 1; | ||
| static final int VERSION_CURRENT = VERSION_BINARY_DV_COMPRESSION; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 to handle bwc via codec version. The alternative is forking the entire codec, but given that the change in the format isn't all over the place, I think this approach is good.
| } | ||
| } | ||
|
|
||
| static boolean isDense(int firstDocId, int lastDocId, int length) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this change is unrelated and was for bulk loading? If so let's undo this for now.
No description provided.