Skip to content

chore: use faster query_and_wait API in _read_gbq_colab #1777

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 24 commits into from
Jun 3, 2025

Conversation

tswast
Copy link
Collaborator

@tswast tswast commented May 28, 2025

Thank you for opening a Pull Request! Before submitting your PR, there are a few things you can do to make sure it goes smoothly:

  • Make sure to open an issue as a bug/issue before writing your code! That way we can discuss the change, evaluate designs, and agree on the general idea
  • Ensure the tests and linter pass
  • Code coverage does not decrease (if any source code was changed)
  • Appropriate docs were updated (if necessary)

Fixes internal issue b/405372623 🦕

@product-auto-label product-auto-label bot added size: m Pull request size is medium. api: bigquery Issues related to the googleapis/python-bigquery-dataframes API. labels May 28, 2025
@product-auto-label product-auto-label bot added size: l Pull request size is large. and removed size: m Pull request size is medium. labels May 29, 2025
@tswast tswast marked this pull request as ready for review May 29, 2025 22:55
@tswast tswast requested review from a team as code owners May 29, 2025 22:55
@tswast tswast requested a review from shobsi May 29, 2025 22:55
Comment on lines 870 to 872
# This is somewhat wasteful, but we convert from Arrow to pandas
# to try to duplicate the same dtypes we'd have if this were a
# table node as best we can.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, ideally, we would not make this round trip. We should be able to convert directly to a managed storage table, but we need to override some default type inferences (notably, need to specify geo, json or they will infer as string). Constructor: https://github.com/googleapis/python-bigquery-dataframes/blob/main/bigframes/core/local_data.py#L85

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in 12fd221

@tswast tswast requested a review from TrevorBergeron May 30, 2025 20:05
Comment on lines +94 to +97
# Not every job populates these. For example, slot_millis is missing
# from queries that came from cached results.
bytes_processed if bytes_processed else 0,
slot_millis if slot_millis else 0,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do slightly worry about these zeros being misinterpreted if we ever summarize into averages, but I guess not a problem for now

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In some cases (cached query) 0 probably makes sense for the average. I agree that it could be misleading, though.

# been DDL or DML. Return some job metadata, instead.
# If there was no destination table and we've made it this far, that
# means the query must have been DDL or DML. Return some job metadata,
# instead.
if not destination:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this function is getting absurdly long, maybe we can pull this job -> stats_df out

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good idea. That is a good candidate to split out. I think I can make a few others as well, such as RowIterator -> DataFrame.

Comment on lines 894 to 895
array_value = core.ArrayValue.from_managed(mat, self._session)
array_with_offsets, offsets_col = array_value.promote_offsets()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The promote_offsets makes sense, the only problem is that this will invalidate the local engines until I implement that node type. Might be better, if messier, just to manually add offsets to the pyarrow table for now?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense. I can do that. I wonder why the test_read_gbq_colab_repr_avoids_requery test in tests/system/small/session/test_read_gbq_colab.py wasn't failing? Maybe because we do an upload and then a download instead of running a query?

Edit: I think it's because I wasn't testing with google-cloud-bigquery 3.34.0 with googleapis/python-bigquery#2190 and not with the query preview environment variable.

Edit2: I did still have to add support for slice to the local executor because of repr doing head, but I suspect that's still easier than implementing promote_offsets.

) -> Tuple[google.cloud.bigquery.table.RowIterator, Optional[bigquery.QueryJob]]:
...

def _start_query(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe two separate methods would be less code than working around all the mypy nonsense to reuse the same symbol?

Copy link
Collaborator Author

@tswast tswast Jun 2, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can give it a try. Note that start_query_with_client already has this parameter though, so we'd end up with some inconsistency there.

Edit: Done in da2bf08

# been DDL or DML. Return some job metadata, instead.
# If there was no destination table and we've made it this far, that
# means the query must have been DDL or DML. Return some job metadata,
# instead.
if not destination:
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good idea. That is a good candidate to split out. I think I can make a few others as well, such as RowIterator -> DataFrame.

Comment on lines 894 to 895
array_value = core.ArrayValue.from_managed(mat, self._session)
array_with_offsets, offsets_col = array_value.promote_offsets()
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense. I can do that. I wonder why the test_read_gbq_colab_repr_avoids_requery test in tests/system/small/session/test_read_gbq_colab.py wasn't failing? Maybe because we do an upload and then a download instead of running a query?

Edit: I think it's because I wasn't testing with google-cloud-bigquery 3.34.0 with googleapis/python-bigquery#2190 and not with the query preview environment variable.

Edit2: I did still have to add support for slice to the local executor because of repr doing head, but I suspect that's still easier than implementing promote_offsets.

@@ -34,7 +34,7 @@ def execute(
if not node:
return None

# TODO: Can support some slicing, sorting
# TODO: Can support some sorting
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TODO(tswast): Add unit tests for new slice support.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

) -> Tuple[google.cloud.bigquery.table.RowIterator, Optional[bigquery.QueryJob]]:
...

def _start_query(
Copy link
Collaborator Author

@tswast tswast Jun 2, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can give it a try. Note that start_query_with_client already has this parameter though, so we'd end up with some inconsistency there.

Edit: Done in da2bf08

@tswast tswast requested a review from TrevorBergeron June 3, 2025 16:21
TrevorBergeron
TrevorBergeron previously approved these changes Jun 3, 2025
@tswast
Copy link
Collaborator Author

tswast commented Jun 3, 2025

e2e failure for isdigit in prerelease tests tracked in b/333484335. It's due to a fix in pyarrow that bigframes hasn't been updated to emulate yet.

@tswast tswast merged commit f495c84 into main Jun 3, 2025
19 of 24 checks passed
@tswast tswast deleted the b405372623-read_gbq-allow_large_results branch June 3, 2025 21:43
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
api: bigquery Issues related to the googleapis/python-bigquery-dataframes API. size: l Pull request size is large.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants