Skip to content

feat: support union type for basic types #510

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 114 commits into from
Jun 21, 2025

Conversation

chardoncs
Copy link
Contributor

@chardoncs chardoncs commented May 18, 2025

Here is my solution on the union type support. It includes basic conversions and representations for the union type.

If there are any problems with the change, feel free to speak out for revision or reject.

Context

This PR is an attempt to solve Issue #436. It includes descriptions and conceptual works about union type support.

Changes

  • Added union in the basic value type, including a new util struct UnionType.
    • [Breaking] Added the union option to the BasicValueType.
    • UnionType includes a BTreeSet to store the types, with extra features.
      • Deduplication: E.g., int | str | int is int | str.
      • Auto-sorting. The current sorting strategy is using the same order of the definition in BasicValueType. It might be > changed to explicit ranks.
      • Auto-flattening: E.g., int | (int | str) | str | float is int | str | float
      • If there is only one type, the parser returns the exact type.
      • String guessing: If some of UUID, DateTime (LocalDateTime, OffsetDateTime, Date, and Time), JSON, and String appear in a union at the same time, the parser will try parsing the value against each possible type in the reversed order of the definition (might be changed to explicit ranks).
  • Union type conversion for Qdrant.
  • Union type conversion for PostgreSQL.
  • JSON Schema for union type, using oneOf.
  • Format display: Union[type1 | type2 | ...].
  • [Breaking] Added union support for type encoder in Python API.~

@chardoncs
Copy link
Contributor Author

I just added some test cases and fixed some bugs.

There are some problems with datetime conversions.

@chardoncs
Copy link
Contributor Author

I have updated the test cases, and it works now.

Now I'm trying to figure out if this implementation works with databases (i.e., Qdrant, Postgres, Neo4j, and Kuzu). My concern right now is especially on Kuzu, because it has a native C-like union type, which functions very different from the others.

@badmonster0
Copy link
Member

My concern right now is especially on Kuzu, because it has a native C-like union type, which functions very different from the others.

That's a fair concern. I had similar concern, e.g. users need to access using tags like val0, val1 etc. which is a little bit arbitrary. I think probably we want to store it as JSON in Kuzu, similar to Postgres. But JSON requires an extension on Kuzu (https://docs.kuzudb.com/extensions/). I think probably you can skip supporting Kuzu for this PR (similar to the Json type, which is also unsupported for Kuzu yet). Later I'll add JSON extension support for Kuzu which will make us easily support union type for it too.

@chardoncs
Copy link
Contributor Author

I just checked current implementation.

I have removed union implementation for Kuzu. And it is JSON for Postgres and Qdrant, but I have introduced tag_id back for JSON conversion, because otherwise the full roundtrip tests would fail. However, there is no tag_id for Neo4j.

@badmonster0
Copy link
Member

I just checked current implementation.

I have removed union implementation for Kuzu. And it is JSON for Postgres and Qdrant, but I have introduced tag_id back for JSON conversion, because otherwise the full roundtrip tests would fail. However, there is no tag_id for Neo4j.

Yes, we should have tag_id for the the normal serialization.

For Postgres, we're using TypedValue to serialize it. As I explained in another comment thread (you can search for "TypedValue", the thread has more context about the specific consideration/context for TypedValue), serialization for TypedValue needs to be updated to drop the tag_id for union types. Hence Postgres won't have tag_id in its JSON.

Copy link
Member

@badmonster0 badmonster0 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please also run ./check.sh, which will run autoformat and will resolve github workflow check errors.

@badmonster0
Copy link
Member

Thanks a lot for pushing through this feature, and all revisions along the way!

@badmonster0 badmonster0 merged commit 40e84b6 into cocoindex-io:main Jun 21, 2025
5 checks passed
@chardoncs chardoncs deleted the expr-union-type-impl branch June 21, 2025 02:11
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants