feat(alloydb): Added generate batch embeddings sample #12721

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

Merged

glasnt merged 40 commits into GoogleCloudPlatform:main from twishabansal:generate_batch_embeddings

Nov 20, 2024

Contributor

twishabansal commented Oct 23, 2024 •

edited

Loading

Description

Note: Before submitting a pull request, please open an issue for discussion if you are not associated with Google.

Checklist

I have followed Sample Guidelines from AUTHORING_GUIDE.MD
README is updated to include all relevant information
Tests pass: nox -s py-3.9 (see Test Environment Setup)
Lint pass: nox -s lint (see Test Environment Setup)
This sample adds a new sample directory, and I updated the CODEOWNERS file with the codeowners for this sample
Please merge this PR for me once it is approved


          docs: Added generate batch embeddings sample

06cdd77

review-notebook-app bot commented Oct 23, 2024

Check out this pull request on

See visual diffs & provide feedback on Jupyter Notebooks.

Powered by ReviewNB

product-auto-label bot added samples api: notebooks labels

twishabansal added 2 commits

October 23, 2024 12:18


          Added outputs

1e24a67


          Changes to be able to run the notebook in local

7166b56

averikitsch reviewed

View reviewed changes

alloydb/notebooks/generate_batch_embeddings.ipynb Outdated Show resolved Hide resolved

alloydb/notebooks/generate_batch_embeddings.ipynb Outdated Show resolved Hide resolved

alloydb/notebooks/generate_batch_embeddings.ipynb Outdated Show resolved Hide resolved

alloydb/notebooks/generate_batch_embeddings.ipynb Outdated Show resolved Hide resolved

alloydb/notebooks/generate_batch_embeddings.ipynb Outdated Show resolved Hide resolved

alloydb/notebooks/generate_batch_embeddings.ipynb Outdated Show resolved Hide resolved

alloydb/notebooks/generate_batch_embeddings.ipynb Outdated Show resolved Hide resolved

alloydb/notebooks/generate_batch_embeddings.ipynb Outdated Show resolved Hide resolved

alloydb/notebooks/generate_batch_embeddings.ipynb Outdated Show resolved Hide resolved

alloydb/notebooks/generate_batch_embeddings.ipynb Outdated Show resolved Hide resolved


          Improved structure and readability

f91d286

twishabansal force-pushed the generate_batch_embeddings branch 2 times, most recently from fdcbe83 to 93aab38 Compare

October 24, 2024 18:05

averikitsch previously requested changes

View reviewed changes

alloydb/notebooks/batch_embeddings_update.ipynb Outdated Show resolved Hide resolved

alloydb/notebooks/batch_embeddings_update.ipynb Outdated Show resolved Hide resolved

alloydb/notebooks/batch_embeddings_update.ipynb Outdated Show resolved Hide resolved

alloydb/notebooks/batch_embeddings_update.ipynb Outdated Show resolved Hide resolved

alloydb/notebooks/batch_embeddings_update.ipynb Outdated Show resolved Hide resolved

alloydb/notebooks/batch_embeddings_update.ipynb Outdated Show resolved Hide resolved

alloydb/notebooks/batch_embeddings_update.ipynb Outdated Show resolved Hide resolved

alloydb/notebooks/batch_embeddings_update.ipynb Outdated

    
                      "id": "D3FUBaXIUquR"

                    },

                    "source": [

                      "This runs the complete embeddings workflow:\n",

Contributor

averikitsch Oct 24, 2024

Theres a bit of a disconnect between the set up and running. In the "Create the embeddings workflow" section I would add some context on that you are setting up the functions you will be using. We also need to prepare the user more for the idea of generating embeddings for multiple columns.

Contributor Author

twishabansal Nov 8, 2024

I have added some more information under the Building an Embeddings Workflow heading as well as added more information on the dataset about what columns are to be embedded.

Let me know what you think!

alloydb/notebooks/batch_embeddings_update.ipynb Outdated Show resolved Hide resolved

alloydb/notebooks/batch_embeddings_update.ipynb Outdated

    
                      ")\n",

                      "\n",

                      "# Update the database with the generated embeddings concurrently\n",

                      "await batch_update_rows_concurrently(\n",

Contributor

averikitsch Oct 24, 2024

As a user I would kind of expect just to run 1 method to generate the embeddings, compared to having to get and batch the source data then generate my embeddings, then updating the database. I have to copy around the variable cols_to_embed a lot. We might be able to simplify this devex more.

Contributor Author

twishabansal Oct 30, 2024 •

edited

Loading

I have made some changes in the code which would enhance the user experience by letting users declare cols to embed (and other variables) and using the run_embeddings_workflow directly.

Another alternative is to let each function use the global cols_to_embed variable and eliminate it's argument. It could make the code harder to maintain and understand.

What do you think?

Contributor

kurtisvg Oct 30, 2024

+1 to keeping it as an argument


          Back to old commit

4cc34e4

twishabansal force-pushed the generate_batch_embeddings branch from 93aab38 to 4cc34e4 Compare

October 25, 2024 05:22

twishabansal and others added 10 commits

October 25, 2024 10:54


          Back to working code

f912ec3


          Resolved comments

169e7d2


          Added indentation

1a92d0b


          Merge branch 'GoogleCloudPlatform:main' into generate_batch_embeddings

baa4f7f


          code cleanup

d8cbb33


          Limit the batch size for text embeddings

951cb04


          fixed: any empty cols to embed, max instances per prediction

6142e9c


          Moved connector above

888a800


          lint

edca943


          cleanup

37d06a2

kurtisvg suggested changes

View reviewed changes

alloydb/notebooks/batch_embeddings_update.ipynb Outdated Show resolved Hide resolved

alloydb/notebooks/batch_embeddings_update.ipynb Outdated Show resolved Hide resolved


          Merge branch 'main' into generate_batch_embeddings

90b38a4

twishabansal marked this pull request as ready for review

November 6, 2024 16:53

twishabansal requested review from a team as code owners

November 6, 2024 16:53

blunderbuss-gcf bot assigned engelke

glasnt assigned glasnt and unassigned engelke

glasnt added the kokoro:force-run label

kokoro-team removed the kokoro:force-run label

glasnt reviewed

View reviewed changes

alloydb/notebooks/embeddings_batch_processing_e2e_test.py Outdated Show resolved Hide resolved

glasnt and others added 3 commits

November 15, 2024 07:56


          Update alloydb/notebooks/embeddings_batch_processing_e2e_test.py

0ee945e


          fix lint errors

38da55e


          fix import order

168473b

glasnt added the kokoro:force-run label

kokoro-team removed the kokoro:force-run label

glasnt reviewed

View reviewed changes

alloydb/notebooks/embeddings_batch_processing_e2e_test.py Outdated Show resolved Hide resolved


          Update alloydb/notebooks/embeddings_batch_processing_e2e_test.py

e08caa6

Contributor

glasnt commented Nov 20, 2024 •

edited

Loading

You will need to add a copy of noxfile_config.py from the root folder, and adjust it to specify what python versions you want to test. (By default, everything but Python 3.8 and Python 3.12 are ignored)
I've been able to verify that I can, as a Google Cloud project owner on a project I own, run the notebook if I import it into Colab Enterprise, as is.
I've also been able to confirm I get the same or similar errors when running pytest locally.

~~I'm still working out how the testing harness works, because I can add broken commands to cells, and not stop the process earlier.~~

Contributor

glasnt commented Nov 20, 2024

Configured production table, let's try to /gcbrun this again.

glasnt added the kokoro:force-run label

kokoro-team removed the kokoro:force-run label


          fix: ignore Python 3.8 (pandas deps issue)

ebaa427

glasnt approved these changes

View reviewed changes

Contributor

glasnt left a comment

Pending confirmation of sample use case (offline)

glasnt reviewed

View reviewed changes

alloydb/notebooks/embeddings_batch_processing_e2e_test.py Outdated

Comment on lines 15 to 17

+              # This sample creates a secure two-service application running on Cloud Run.
+              # This test builds and deploys the two secure services
+              # to test that they interact properly together.

Contributor

glasnt Nov 20, 2024

Suggest removing this bitrot comment, and add a note about data persistence.

Suggested change

      
            # This sample creates a secure two-service application running on Cloud Run.
          
            # This test builds and deploys the two secure services
          
            # to test that they interact properly together.
          
            # Maintainer Note: this sample presumes data exists in ALLOYDB_TABLE_NAME within the ALLOYDB_(cluster/instance/database)

Contributor

glasnt Nov 20, 2024

This file might be better suited as e2e_test.py for simplicity, as well (any _test.py file will be picked up by pytest, the exact name doesn't matter.

glasnt assigned glasnt and unassigned glasnt

kurtisvg approved these changes

View reviewed changes

Contributor

kurtisvg left a comment

LGTM, with the suggestion that we might want to test on python 3.10 (currently the version Colab runs)

glasnt reviewed

View reviewed changes

alloydb/notebooks/noxfile_config.py Outdated Show resolved Hide resolved


          update tested python versions

aa253d6

kurtisvg approved these changes

View reviewed changes


          Rename e2e_file.py, update header commentary

371825b

glasnt reviewed

View reviewed changes

alloydb/notebooks/e2e_test.py Outdated Show resolved Hide resolved


          Update alloydb/notebooks/e2e_test.py

19e2039

glasnt merged commit 08e0146 into GoogleCloudPlatform:main

11 checks passed

glasnt mentioned this pull request

Add notebook to quickly generate and load embeddings in AlloyDB. #12588

Closed

6 tasks

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

api: notebooks samples