Skip to content

Conversation

dborowitz
Copy link

Description

Add a new source serverless-spark for connecting to Google Cloud Serverless for Apache Spark, along with a single simple tool serverless-spark-list-batches.

One outstanding though possibly trivial question is how to address naming/branding/source splitting. The official product name is "Google Cloud Serverless for Apache Spark", though this is a relatively recent rebranding of "Dataproc Serverless". I figured for the source and tool names we could use the short version "serverless-spark", but I've tried to retain the officially correct name in documentation. Granted, it's quite a mouthful and sticks out compared to the other names in Toolbox. On the other hand, I don't want to imply that this works with any Spark hosting environment besides Google's. I'll double-check the branding guidelines with our PMs, but I appreciate Toolbox maintainers' feedback as well.

PR Checklist

Thank you for opening a Pull Request! Before submitting your PR, there are a
few things you can do to make sure it goes smoothly:

  • Make sure you reviewed
    CONTRIBUTING.md
  • Make sure to open an issue as a
    bug/issue
    before writing your code! That way we can discuss the change, evaluate
    designs, and agree on the general idea
  • Ensure the tests and linter pass
  • Code coverage does not decrease (if any source code was changed)
  • Appropriate docs were updated (if necessary)
  • Make sure to add ! if this involve a breaking change

🛠️ Part of #1689

@dborowitz dborowitz requested a review from a team as a code owner October 10, 2025 21:31
@dborowitz dborowitz force-pushed the serverless-spark branch 2 times, most recently from 2e450ef to fad52b7 Compare October 13, 2025 16:30
Copy link
Contributor

@duwenxin99 duwenxin99 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @dborowitz, thanks for the PR! The source and tool implementations overall look good, but we need more thorough integration test to make sure things are working correctly in the long term.

// or failed Serverless Spark batches, of any age.
func runListBatchesTest(t *testing.T) {
requestBody := bytes.NewBuffer([]byte(`{"pageSize": 2, "filter": "state = SUCCEEDED OR state = FAILED"}`))
req, err := http.NewRequest(http.MethodPost, "http://127.0.0.1:5000/api/tool/list-batches/invoke", requestBody)
Copy link
Contributor

@duwenxin99 duwenxin99 Oct 13, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we add a more comprehensive golang test table here to make sure all the edge cases are tested? Ideally we want to test for every parameter including the pageToken. We should also be testing edge cases like empty results etc. The generic authentication feature testing should also be included. Feel free to refer to the BigQuery tests as examples. Thank you!

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

To keep it compatible with arbitrary test projects, I took the approach of listing results with the underlying API directly, then comparing a possibly-paginated version of the results from the tool with the underlying results.

Note this is inherently racy, but the hope is that we won't be frequently creating new batches in the integration test project. If it turns out to be an issue, we can always filter to batches created >30m ago or something.

@dborowitz dborowitz force-pushed the serverless-spark branch 2 times, most recently from 6976c99 to c2a6087 Compare October 14, 2025 18:09
@dborowitz dborowitz marked this pull request as draft October 14, 2025 18:22
@dborowitz
Copy link
Author

I've converted to draft to ensure we don't merge before confirming the documentation/branding story with my PM. Other than that, I've resolved the one comment from @duwenxin99 about integration tests, and it's ready for review.

… tool

Built as a thin wrapper over the official Google Cloud Dataproc Go
client library, with support for filtering and pagination.
@dborowitz dborowitz marked this pull request as ready for review October 15, 2025 16:42
@dborowitz dborowitz requested a review from duwenxin99 October 15, 2025 16:42
@dborowitz
Copy link
Author

Updated the docs after consulting with PM, and added docs for the prebuilt toolset, which I had missed before.

@duwenxin99
Copy link
Contributor

@dborowitz Thanks for the test update! Could you add auth test cases to every tool like BQ does here to make sure the auth feature is working correctly. An authSource config should be added like this and the test invocation should include a token header. Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants