- 
                Notifications
    You must be signed in to change notification settings 
- Fork 161
          feat(go/adbc/driver/bigquery): add BIGQUERY:type field metadata
          #3604
        
          New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
  
    feat(go/adbc/driver/bigquery): add BIGQUERY:type field metadata
  
  #3604
              Conversation
The `Type` metadata key has two limitations which stems from BigQuery's API: 1. it says fields of type `ARRAY<T>` are just `T` with `Repeated=true` 2. it says `STRUCT<...>` fields are simply `RECORD`, and erases any information about the inner fields. These limitations can cause problems when trying to parse the `Type` key or when using it verbating against the warehouse in a statement, e.g a `CREATE TABLE` statement or a `AS T` cast. This PR adds a new `BIGQUERY:type` key that formats the original SQL string as specified by BigQuery. Most types remain unchanged as they come from `gobigquery`, and in those cases this key will contain the same value as `Type`. However, arrays and structs get transformed to match the richer type string.
BIGQUERY:type field metadataBIGQUERY:type field metadata
      | Hmmm seeing some Python failures in CI, not sure how they're related? | 
| metadata["Repeated"] = strconv.FormatBool(schema.Repeated) | ||
| metadata["Required"] = strconv.FormatBool(schema.Required) | ||
| field.Nullable = !schema.Required | ||
| metadata["Type"] = string(schema.Type) | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we want to keep this? Should we rename it to something like BIGQUERY:simple_type?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was wondering the same thing. I thought of keeping it as-is to avoid breaking changes to end users. But if the project maintainers are fine with a breaking change I can do it!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would personally prefer we namespace all the properties now that we want to introduce this convention. Possibly we can keep the existing properties under their current name with a deprecation notice.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@lidavidm I pushed a new commit that does that. I called it BIGQUERY:raw_type since it's the "raw" unprocessed thing coming directly from the API. I think this is a bit more descriptive than simple_type.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not sure if this should be split across two PRs though. IMO these should be two separate changelog entries: one for standardizing the keys under BIGQUERY:... and another for adding the new rich type key.
If that's the case I can merge the last commit first in a separate PR, then rebase this one on top of that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@lidavidm I think the convention in the BigQuery driver is that all the fields in the JSON response are copies as-is into the Arrow field metadata.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's not a law of physics, though. We can change things.
If you want to defer this to a separate PR, that's fine by me. But I think they should be consistent.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@serramatutu can you move the last commit to a separate PR? So we can merge this one with just the BIGQUERY:type addition.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I just removed the last commit from this branch. I have it on a separate branch and I can open a followup PR after this one to standardize all keys.
c87220a    to
    bb7440d      
    Compare
  
    | Merging because the failing checks are Meson+PG and CMake specific. | 
Motivation
The
Typemetadata key has two limitations which stems from BigQuery's API:ARRAY<T>are justTwithRepeated=trueSTRUCT<...>fields are simplyRECORD, and erases any information about the inner fields.These limitations can cause problems when trying to parse the
Typekey or when using it verbatim against the warehouse in a statement, e.g aCREATE TABLEstatement or aAS Tcast.Summary
This PR adds a new
BIGQUERY:typekey that formats the original SQL string as specified by BigQuery.Most types remain unchanged as they come from
gobigquery, and in those cases this key will contain the same value asType.However, arrays and structs get transformed to match the richer type string.
Testing
I ran a
CREATE TABLE ASquery against BigQuery. Here's the result for fields of different types[1] Regular non-nested types are simply copied over from the value of
Type1
[2] An array of integers becomes
ARRAY<INTEGER>, whileTyperemainsINTEGER2
[3] An array of structs becomes
ARRAY<STRUCT<...>>3
[4] A struct of arrays' inner types are
ARRAY<...>4
[5] A deeply nested struct also has the correct inner types
5
Related issues