-
Notifications
You must be signed in to change notification settings - Fork 165
feat(c/include/arrow-adbc): Add AdbcStatementRequestSchema #3623
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
|
Thanks for this! I wonder if we also want to somehow allow things like "read 'field_name' as string view, but leave other fields alone" or "read all 'database_type' as string view, but leave other fields alone" |
Sure, or some version of an option get/set with of a list of supported or preferred types. Happy to take out the negotiation bit if that's confusing (the main purpose here is to help the driver make better choices for loose or row-based typing). |
|
This looks pretty good. On one hand, it's not the most ergonomic way to handle this but it's the most regular with the existing API. Am I right that the most convenient way to use this API may be to first call Another question, should it be spelled out that the schema can be partial or not? |
|
I think the reason for adding this would be for a user who executed and collected a query but got an error along the lines of "can't append string to column of type int64", which is what would happen today if someone issued a query with the sqllite driver and first few thousand values were null. The second reason is for people who already know the arrow type they want, perhaps because they're using a tool like ibis that already knows or because they executed the query and the driver guessed something they didn't like or that was inconsistent with a previous query against the same data. I should probably take out the negotiation bit since it seems to be distracting from the intent...it's orthogonal to options, which would only affect the default guessing (but still have no way to specifically specify type parameters or specific problematic columns). (These are all roughly the same reason why a requested schema is provided for a csv reader.) |
|
two questions:
|
|
Wouldn't you just call this in between NextResultSetSchema and NextResultSet? We'd just generalize NextResultSetSchema. |
|
Mostly I'm just trying to cut down the number of extra functions to create. If the idea is that we wouldn't need to create a new function to handle multiple result sets, then I'm okay with that. And that just leaves my first question |
Yes, it's intended as guidance where a driver would otherwise have to guess. I wrote it up here as best-effort (same as PyCapsule |
|
Well, maybe we want to cover cases like explicitly choosing between string/large_string/string_view (in principle the driver would have picked one, but maybe the user wants one of the other types and this can avoid a later conversion). Or is the thought that we need to add driver-specific flags for all of that? @CurtHagenlocher had commented before that with JDBC/ODBC, you get to pick the type you want when you get the data - that's obviously too much flexibility for an Arrow stream, but I don't see why we should limit this to only cases where it is ambiguous to the driver. (In the first place, the underlying data isn't Arrow much of the time; even if it is, ideally the request would be pushed back to the backend.) The driver could certainly ignore requests to cast, of course, especially if it's actually just going to be a cast internally. |
|
I can add that bit back in too...whatever is less confusing! We can also experiment with schema negotiation afterward (I don't think it affects the behaviour of the function but I could be missing something). I think options are orthogonal...options affect the guessing of the schema in the absence of a request and we could have those, too. |
This PR proposes a 1.2 spec addition
AdbcStatementRequestSchema(). There are a few decisions here and while I put specific language in I think any version of these is still helpful:requested_schema). I think this is better for negotiation (e.g., the caller calls execute schema, requests a schema with all strings/binary/list columns as views, and the driver produces string/binary views but not list views because it doesn't implement those). A strict version is also useful and might generate better errors.struct ArrowSchema*should beconst ArrowSchema*. I'm not sure this matters (mostly the C struct will be populated specifically for this call and they are easy to copy).RequestSchemaas the name if there's a better oneThe capability to do this is built in to the Postgres driver already for some types (the COPY reader was designed to always accept an arbitrary ArrowSchema, which we currently just infer before constructing the reader).
Other 1.2 spec additions in #3607 (which also contains the infrastructure for a new ADBC minor version).
Closes #1514 .