Skip to content

Conversation

@paleolimbot
Copy link
Member

This PR proposes a 1.2 spec addition AdbcStatementRequestSchema(). There are a few decisions here and while I put specific language in I think any version of these is still helpful:

  • I wrote this up as this best-effort (like the PyCapsule requested_schema). I think this is better for negotiation (e.g., the caller calls execute schema, requests a schema with all strings/binary/list columns as views, and the driver produces string/binary views but not list views because it doesn't implement those). A strict version is also useful and might generate better errors.
  • Maybe struct ArrowSchema* should be const ArrowSchema*. I'm not sure this matters (mostly the C struct will be populated specifically for this call and they are easy to copy).
  • No opinions on RequestSchema as the name if there's a better one

The capability to do this is built in to the Postgres driver already for some types (the COPY reader was designed to always accept an arbitrary ArrowSchema, which we currently just infer before constructing the reader).

Other 1.2 spec additions in #3607 (which also contains the infrastructure for a new ADBC minor version).

Closes #1514 .

@lidavidm
Copy link
Member

Thanks for this!

I wonder if we also want to somehow allow things like "read 'field_name' as string view, but leave other fields alone" or "read all 'database_type' as string view, but leave other fields alone"

@paleolimbot
Copy link
Member Author

I wonder if we also want to somehow allow things like "read 'field_name' as string view, but leave other fields alone" or "read all 'database_type' as string view, but leave other fields alone"

Sure, or some version of an option get/set with of a list of supported or preferred types. Happy to take out the negotiation bit if that's confusing (the main purpose here is to help the driver make better choices for loose or row-based typing).

@lidavidm lidavidm requested review from amoeba and zeroshade November 11, 2025 01:16
@amoeba
Copy link
Member

amoeba commented Nov 11, 2025

This looks pretty good. On one hand, it's not the most ergonomic way to handle this but it's the most regular with the existing API. Am I right that the most convenient way to use this API may be to first call AdbcStatementExecuteSchema (and hope it's implemented), then modify the returned schema to my desire, call AdbcStatementRequestSchema with the modified schema, and then execute my statement?

Another question, should it be spelled out that the schema can be partial or not?

@paleolimbot
Copy link
Member Author

I think the reason for adding this would be for a user who executed and collected a query but got an error along the lines of "can't append string to column of type int64", which is what would happen today if someone issued a query with the sqllite driver and first few thousand values were null.

The second reason is for people who already know the arrow type they want, perhaps because they're using a tool like ibis that already knows or because they executed the query and the driver guessed something they didn't like or that was inconsistent with a previous query against the same data.

I should probably take out the negotiation bit since it seems to be distracting from the intent...it's orthogonal to options, which would only affect the default guessing (but still have no way to specifically specify type parameters or specific problematic columns).

(These are all roughly the same reason why a requested schema is provided for a csv reader.)

@zeroshade
Copy link
Member

two questions:

  1. Is the expectation that the driver itself would cast/convert the types as they come back from the source?
  2. Since we're also working on and discussing handling multiple result sets, how would this work/handle the case of multiple result sets? Would we need an entirely new function once we add support for multiple result sets to handle this for that case? Is there any way we can adapt this function so that it can work in both cases somehow instead of needing to create an entirely new function afterwards?

@lidavidm
Copy link
Member

Wouldn't you just call this in between NextResultSetSchema and NextResultSet? We'd just generalize NextResultSetSchema.

@zeroshade
Copy link
Member

Mostly I'm just trying to cut down the number of extra functions to create. If the idea is that we wouldn't need to create a new function to handle multiple result sets, then I'm okay with that. And that just leaves my first question

@paleolimbot
Copy link
Member Author

Is the expectation that the driver itself would cast/convert the types as they come back from the source?

Yes, it's intended as guidance where a driver would otherwise have to guess. I wrote it up here as best-effort (same as PyCapsule requested_schema)...the other choice would be that the driver is required to produce that schema. I don't happen think that's particularly useful (whatever schema is produced by the output stream is always the stream that should be used to interpret the data), but for the immediate use cases (SQLite and Postgres drivers) it would also not be a problem.

@lidavidm
Copy link
Member

Well, maybe we want to cover cases like explicitly choosing between string/large_string/string_view (in principle the driver would have picked one, but maybe the user wants one of the other types and this can avoid a later conversion). Or is the thought that we need to add driver-specific flags for all of that? @CurtHagenlocher had commented before that with JDBC/ODBC, you get to pick the type you want when you get the data - that's obviously too much flexibility for an Arrow stream, but I don't see why we should limit this to only cases where it is ambiguous to the driver. (In the first place, the underlying data isn't Arrow much of the time; even if it is, ideally the request would be pushed back to the backend.)

The driver could certainly ignore requests to cast, of course, especially if it's actually just going to be a cast internally.

@paleolimbot
Copy link
Member Author

I can add that bit back in too...whatever is less confusing! We can also experiment with schema negotiation afterward (I don't think it affects the behaviour of the function but I could be missing something). I think options are orthogonal...options affect the guessing of the schema in the absence of a request and we could have those, too.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

feat: Add the ability to request a schema from a statement

4 participants