-
Notifications
You must be signed in to change notification settings - Fork 1.4k
Add Apache Arrow format decoder to Pinot #17031
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## master #17031 +/- ##
============================================
- Coverage 63.51% 63.49% -0.03%
Complexity 1419 1419
============================================
Files 3082 3082
Lines 181844 181844
Branches 27916 27916
============================================
- Hits 115500 115459 -41
- Misses 57458 57501 +43
+ Partials 8886 8884 -2
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR introduces Apache Arrow format decoding support to Pinot, enabling the ingestion of streaming data in Arrow IPC format. The decoder converts Arrow batches into Pinot's GenericRow format, supporting primitive types, complex nested structures (lists, maps, structs), and dictionary encoding for improved data compression.
Key Changes
- Arrow decoder implementation with dictionary encoding support reduces Kafka data volume by 20-80%
- Comprehensive type conversion handling between Arrow and Pinot data types
- Support for nested data structures (lists, maps, structs) with recursive conversion
Reviewed Changes
Copilot reviewed 6 out of 6 changed files in this pull request and generated 5 comments.
Show a summary per file
File | Description |
---|---|
pinot-plugins/pinot-input-format/pom.xml | Added pinot-arrow module to the build |
pinot-plugins/pinot-input-format/pinot-arrow/pom.xml | Defined dependencies for Apache Arrow libraries (version 18.0.0) |
pinot-plugins/pinot-input-format/pinot-arrow/src/main/java/org/apache/pinot/plugin/inputformat/arrow/ArrowMessageDecoder.java | Implements StreamMessageDecoder to decode Arrow IPC messages into GenericRow |
pinot-plugins/pinot-input-format/pinot-arrow/src/main/java/org/apache/pinot/plugin/inputformat/arrow/ArrowToGenericRowConverter.java | Handles conversion logic from Arrow VectorSchemaRoot to GenericRow with type compatibility handling |
pinot-plugins/pinot-input-format/pinot-arrow/src/test/java/org/apache/pinot/plugin/inputformat/arrow/util/ArrowTestDataUtil.java | Utility for generating test Arrow data with various data types and structures |
pinot-plugins/pinot-input-format/pinot-arrow/src/test/java/org/apache/pinot/plugin/inputformat/arrow/ArrowMessageDecoderTest.java | Comprehensive test suite covering decoder functionality, data types, and edge cases |
private Object flattenArrowMap(MapVector fieldVector, int rowIndex) { | ||
Map<String, Object> flattened = new LinkedHashMap<>(); | ||
UnionMapReader reader = fieldVector.getReader(); | ||
reader.setPosition(rowIndex); | ||
while (reader.next()) { | ||
flattened.put( | ||
reader.key().readObject().toString(), | ||
convertArrowTypeToPinotCompatible(reader.value().readObject())); | ||
} | ||
return flattened; | ||
} |
Copilot
AI
Oct 17, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The method flattenArrowMap
is defined but never called within this class. This is dead code that should either be removed or integrated into the conversion logic if it was intended to handle MapVector types.
private Object flattenArrowMap(MapVector fieldVector, int rowIndex) { | |
Map<String, Object> flattened = new LinkedHashMap<>(); | |
UnionMapReader reader = fieldVector.getReader(); | |
reader.setPosition(rowIndex); | |
while (reader.next()) { | |
flattened.put( | |
reader.key().readObject().toString(), | |
convertArrowTypeToPinotCompatible(reader.value().readObject())); | |
} | |
return flattened; | |
} |
Copilot uses AI. Check for mistakes.
private String buildEventLatencyMetricName(String kafkaTopicName) { | ||
return "ServerMetrics.realtime.arrow.consumer.delay." | ||
+ kafkaTopicName.replace(".", "_"); | ||
} | ||
|
Copilot
AI
Oct 17, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The method buildEventLatencyMetricName
is defined but never called. This appears to be dead code that should be removed unless metrics collection is planned for future implementation.
private String buildEventLatencyMetricName(String kafkaTopicName) { | |
return "ServerMetrics.realtime.arrow.consumer.delay." | |
+ kafkaTopicName.replace(".", "_"); | |
} |
Copilot uses AI. Check for mistakes.
* This decoder handles Arrow streaming format and converts Arrow data to Pinot's columnar format. | ||
*/ | ||
public class ArrowMessageDecoder implements StreamMessageDecoder<byte[]> { | ||
public static final String ARROW_SCHEMA_CONFIG = "arrow.schema.config"; |
Copilot
AI
Oct 17, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The constant ARROW_SCHEMA_CONFIG
is defined but never used in this class or the test files. Consider removing this unused constant or documenting its intended future use.
public static final String ARROW_SCHEMA_CONFIG = "arrow.schema.config"; |
Copilot uses AI. Check for mistakes.
private Set<String> _fieldsToRead; | ||
private RootAllocator _allocator; | ||
private ArrowToGenericRowConverter _converter; | ||
|
||
@Override | ||
public void init(Map<String, String> props, Set<String> fieldsToRead, String topicName) | ||
throws Exception { | ||
_kafkaTopicName = topicName; | ||
_fieldsToRead = fieldsToRead; |
Copilot
AI
Oct 17, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The field _fieldsToRead
is assigned in the init
method but never used elsewhere in the class. Either implement field filtering logic based on this set or remove it if all fields should always be read.
private Set<String> _fieldsToRead; | |
private RootAllocator _allocator; | |
private ArrowToGenericRowConverter _converter; | |
@Override | |
public void init(Map<String, String> props, Set<String> fieldsToRead, String topicName) | |
throws Exception { | |
_kafkaTopicName = topicName; | |
_fieldsToRead = fieldsToRead; | |
private RootAllocator _allocator; | |
private ArrowToGenericRowConverter _converter; | |
@Override | |
public void init(Map<String, String> props, String topicName) | |
throws Exception { | |
_kafkaTopicName = topicName; |
Copilot uses AI. Check for mistakes.
logger.error( | ||
"Error decoding Arrow message for kafka topic {} : {}", | ||
_kafkaTopicName, | ||
Arrays.toString(payload), | ||
e); |
Copilot
AI
Oct 17, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Logging the full payload with Arrays.toString(payload)
could expose sensitive data in logs. Consider logging only payload metadata like length or a hash instead of the actual content.
Copilot uses AI. Check for mistakes.
<properties> | ||
<pinot.root>${basedir}/../../..</pinot.root> | ||
<shade.phase.prop>package</shade.phase.prop> | ||
<arrow.version>18.0.0</arrow.version> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
pinot root has an arrow version defined already
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
overall lgtm
return row; | ||
} catch (Exception e) { | ||
logger.error( | ||
"Error decoding Arrow message for kafka topic {} : {}", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not specific to kafka
feature
ingestion
Issue: #16643
Add the initial version of Arrow decoder to Pinot. With the decoder, Pinot can decode the stream data in the basic Apache arrow format. This is part of the 1st stage delivery of the proposal above.
Performance and Improvements:
Some limitation and TODOs: