Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[C++] Import of Map type via the C Data interface drops child field metadata #44714

Open
paleolimbot opened this issue Nov 13, 2024 · 0 comments

Comments

@paleolimbot
Copy link
Member

Describe the bug, including details regarding any error messages, version, and platform.

When Map types are received via the C Data interface, field metadata (including extension metadata) is dropped. This seems unintentional given that we maintain that metadata for a list of structs:

import duckdb

duckdb_cursor = duckdb.connect()
duckdb_cursor.execute("SET arrow_lossless_conversion = true")
arrow_table = duckdb_cursor.execute("select map {uuid(): 1::uhugeint, uuid(): 2::uhugeint} as li").arrow()
res = duckdb_cursor.execute("select typeof(li) FROM arrow_table").fetchall()
print ("map type")
print (arrow_table.schema)
print (res)
# map type
# li: map<fixed_size_binary[16], fixed_size_binary[16]>
#   child 0, entries: struct<key: fixed_size_binary[16] not null, value: fixed_size_binary[16]> not null
#       child 0, key: fixed_size_binary[16] not null
#       child 1, value: fixed_size_binary[16]
# [('MAP(BLOB, BLOB)',)]

arrow_table = duckdb_cursor.execute("select [{'keys': uuid(), 'values': uuid()}] as li").arrow()
res = duckdb_cursor.execute("select typeof(li) FROM arrow_table").fetchall()
print ("fixed size list type")
print (arrow_table.schema)
print (res)
# map type
# li: list<l: struct<keys: fixed_size_binary[16], values: fixed_size_binary[16]>>
#   child 0, l: struct<keys: fixed_size_binary[16], values: fixed_size_binary[16]>
#       child 0, keys: fixed_size_binary[16]
#       -- field metadata --
#       ARROW:extension:metadata: ''
#       ARROW:extension:name: 'arrow.uuid'
#       child 1, values: fixed_size_binary[16]
#       -- field metadata --
#       ARROW:extension:metadata: ''
#       ARROW:extension:name: 'arrow.uuid'
# [('STRUCT(keys UUID, "values" UUID)[]',)]

This occurs because we reconstruct the fields to canonicalize the field names:

Status ProcessMap() {
RETURN_NOT_OK(f_parser_.CheckAtEnd());
RETURN_NOT_OK(CheckNumChildren(1));
ARROW_ASSIGN_OR_RAISE(auto field, MakeChildField(0));
const auto& value_type = field->type();
if (value_type->id() != Type::STRUCT) {
return Status::Invalid("Imported map array has unexpected child field type: ",
field->ToString());
}
if (value_type->num_fields() != 2) {
return Status::Invalid("Imported map array has unexpected child field type: ",
field->ToString());
}
bool keys_sorted = (c_struct_->flags & ARROW_FLAG_MAP_KEYS_SORTED);
bool values_nullable = value_type->field(1)->nullable();
// Some implementations of Arrow (such as Rust) use a non-standard field name
// for key ("keys") and value ("values") fields. For simplicity, we override
// them on import.
auto values_field =
::arrow::field("value", value_type->field(1)->type(), values_nullable);
type_ = map(value_type->field(0)->type(), values_field, keys_sorted);
return Status::OK();
}

I think that we don't have that problem in the IPC type conversion:

*out = std::make_shared<MapType>(children[0]->type()->field(0)->WithName("key"),
children[0]->type()->field(1)->WithName("value"),
map->keysSorted());

Component(s)

C++

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant