diff --git a/alloydb/notebooks/batch_embeddings_update.ipynb b/alloydb/notebooks/batch_embeddings_update.ipynb new file mode 100644 index 000000000000..4c836eb88b17 --- /dev/null +++ b/alloydb/notebooks/batch_embeddings_update.ipynb @@ -0,0 +1,1122 @@ +{ + "cells": [ + { + "cell_type": "code", + "execution_count": 1, + "metadata": { + "id": "upi2EY4L9ei3" + }, + "outputs": [], + "source": [ + "# Copyright 2024 Google LLC\n", + "#\n", + "# Licensed under the Apache License, Version 2.0 (the \"License\");\n", + "# you may not use this file except in compliance with the License.\n", + "# You may obtain a copy of the License at\n", + "#\n", + "# https://www.apache.org/licenses/LICENSE-2.0\n", + "#\n", + "# Unless required by applicable law or agreed to in writing, software\n", + "# distributed under the License is distributed on an \"AS IS\" BASIS,\n", + "# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n", + "# See the License for the specific language governing permissions and\n", + "# limitations under the License." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "mbF2F2miAT4a" + }, + "source": [ + "# Batch Embeddings Update\n", + "\n", + "[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/GoogleCloudPlatform/python-docs-samples/blob/main/alloydb/notebooks/generate_batch_embeddings.ipynb)\n", + "\n", + "---\n", + "## Introduction\n", + "\n", + "This notebook demonstrates an efficient way to generate and store vector embeddings in AlloyDB. You'll learn how to:\n", + "\n", + "* **Optimize embedding generation**: Dynamically batch text chunks based on character length to generate more embeddings with each API call.\n", + "* **Streamline storage**: Use [Asyncio](https://docs.python.org/3/library/asyncio.html) to seamlessly update AlloyDB with the generated embeddings.\n", + "\n", + "This approach significantly speeds up the process, especially for large datasets, making it ideal for efficiently handling large-scale embedding tasks." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "FbcZUjT1yvTq" + }, + "source": [ + "## What you'll need\n", + "\n", + "* A Google Cloud Account and Google Cloud Project" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "vHdR4fF3vLWA" + }, + "source": [ + "## Objectives\n", + "\n", + "In the following instructions you will learn to:\n", + "\n", + "1. Install required dependencies for our application\n", + "2. Set up authentication for our project\n", + "3. Set up a AlloyDB for PostgreSQL Instance\n", + "4. Import the data used by our application" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "uy9KqgPQ4GBi" + }, + "source": [ + "## Basic Setup\n", + "### Install dependencies" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "M_ppDxYf4Gqs" + }, + "outputs": [], + "source": [ + "%pip install google-cloud-alloydb-connector[asyncpg]==1.4.0 sqlalchemy==2.0.36 pandas==2.2.3 vertexai==1.70.0 asyncio==3.4.3 greenlet==3.1.1 --quiet" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Authenticate to Google Cloud within Colab\n", + "If you're running this on google colab notebook, you will need to Authenticate as an IAM user." + ] + }, + { + "cell_type": "code", + "execution_count": 3, + "metadata": {}, + "outputs": [], + "source": [ + "# from google.colab import auth\n", + "\n", + "# auth.authenticate_user()" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "UCiNGP1Qxd6x" + }, + "source": [ + "### Connect Your Google Cloud Project" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "id": "SLUGlG6UE2CK", + "outputId": "a284c046-00df-414a-9039-ddc5df12536d" + }, + "outputs": [], + "source": [ + "# @markdown Please fill in the value below with your GCP project ID and then run the cell.\n", + "\n", + "# Please fill in these values.\n", + "project_id = \"my-project-id\" # @param {type:\"string\"}\n", + "\n", + "# Quick input validations.\n", + "assert project_id, \"⚠️ Please provide a Google Cloud project ID\"\n", + "\n", + "# Configure gcloud.\n", + "!gcloud config set project {project_id}" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "O-oqMC5Ox-ZM" + }, + "source": [ + "### Enable APIs for AlloyDB and Vertex AI" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "X-bzfFb4A-xK" + }, + "source": [ + "You will need to enable these APIs in order to create an AlloyDB database and utilize Vertex AI as an embeddings service!" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "id": "CKWrwyfzyTwH", + "outputId": "f5131e77-2750-4cb1-b153-c52a13aaf284" + }, + "outputs": [], + "source": [ + "!gcloud services enable alloydb.googleapis.com aiplatform.googleapis.com" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "Gn8g7-wCyZU6" + }, + "source": [ + "## Set up AlloyDB\n", + "You will need a Postgres AlloyDB instance for the following stages of this notebook. Please set the following variables." + ] + }, + { + "cell_type": "code", + "execution_count": 6, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "id": "8q2lc-Po1mPv", + "outputId": "e268aea8-0514-4308-f5c7-1916031255b7" + }, + "outputs": [], + "source": [ + "# @markdown Please fill in the both the Google Cloud region and name of your AlloyDB instance. Once filled in, run the cell.\n", + "\n", + "# Please fill in these values.\n", + "region = \"us-central1\" # @param {type:\"string\"}\n", + "cluster_name = \"my-cluster\" # @param {type:\"string\"}\n", + "instance_name = \"my-primary\" # @param {type:\"string\"}\n", + "database_name = \"test_db\" # @param {type:\"string\"}\n", + "table_name = \"investments\"\n", + "password = input(\"Please provide a password to be used for 'postgres' database user: \")" + ] + }, + { + "cell_type": "code", + "execution_count": 7, + "metadata": { + "id": "XXI1uUu3y8gc" + }, + "outputs": [], + "source": [ + "# Quick input validations.\n", + "assert region, \"⚠️ Please provide a Google Cloud region\"\n", + "assert instance_name, \"⚠️ Please provide the name of your instance\"\n", + "assert database_name, \"⚠️ Please provide the name of your database_name\"" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "T616pEOUygYQ" + }, + "source": [ + "### Create an AlloyDB Instance\n", + "If you have already created an AlloyDB Cluster and Instance, you can skip these steps and skip to the Create a database section." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "xyZYX4Jo1vfh" + }, + "source": [ + "> ⏳ - Creating an AlloyDB cluster may take a few minutes." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "id": "MQYni0NlTLzC", + "outputId": "118d9a2b-2d9d-44ae-a33f-fb89ed6a2895" + }, + "outputs": [], + "source": [ + "# create the AlloyDB Cluster\n", + "!gcloud beta alloydb clusters create {cluster_name} --password={password} --region={region}" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "o8LkscYH5Vfp" + }, + "source": [ + "Create an instance attached to our cluster with the following command.\n", + "> ⏳ - Creating an AlloyDB instance may take a few minutes." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "id": "TkqQSWoY5Kab", + "outputId": "78e02d10-5e14-457a-86c6-21348898bd0a" + }, + "outputs": [], + "source": [ + "!gcloud beta alloydb instances create {instance_name} --instance-type=PRIMARY --cpu-count=2 --region={region} --cluster={cluster_name}" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "BXsQ1UJv4ZVJ" + }, + "source": [ + "To connect to your AlloyDB instance from this notebook, you will need to enable public IP on your instance. Alternatively, you can follow [these instructions](https://cloud.google.com/alloydb/docs/connect-external) to connect to an AlloyDB for PostgreSQL instance with Private IP from outside your VPC." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "id": "OPVWsQB04Yyl", + "outputId": "79f213ac-a069-4b15-e949-189f166dfca1" + }, + "outputs": [], + "source": [ + "!gcloud beta alloydb instances update {instance_name} --region={region} --cluster={cluster_name} --assign-inbound-public-ip=ASSIGN_IPV4 --database-flags=\"password.enforce_complexity=on\"" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "UabC_qh5HOVy" + }, + "source": [ + "Please wait for the instance to be updated. This might take some time. You can see if the changes are reflecting using:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "id": "_KC91mQZHABv", + "outputId": "6da8a6e4-549b-428d-a488-8dc993ddd216" + }, + "outputs": [], + "source": [ + "!gcloud beta alloydb instances describe {instance_name} --region={region} --cluster={cluster_name}" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "_K86id-dcjcm" + }, + "source": [ + "### Connect to AlloyDB\n", + "\n", + "This function will create a connection pool to your AlloyDB instance using the [AlloyDB Python connector](https://github.com/GoogleCloudPlatform/alloydb-python-connector). The AlloyDB Python connector will automatically create secure connections to your AlloyDB instance using mTLS." + ] + }, + { + "cell_type": "code", + "execution_count": 15, + "metadata": { + "id": "fYKVQzv2cjcm" + }, + "outputs": [], + "source": [ + "import asyncpg\n", + "\n", + "import sqlalchemy\n", + "from sqlalchemy.ext.asyncio import AsyncEngine, create_async_engine\n", + "\n", + "from google.cloud.alloydb.connector import AsyncConnector, IPTypes\n", + "\n", + "async def init_connection_pool(connector: AsyncConnector, db_name: str, pool_size: int = 5) -> AsyncEngine:\n", + " # initialize Connector object for connections to AlloyDB\n", + " connection_string = f\"projects/{project_id}/locations/{region}/clusters/{cluster_name}/instances/{instance_name}\"\n", + "\n", + " async def getconn() -> asyncpg.Connection:\n", + " conn: asyncpg.Connection = await connector.connect(\n", + " connection_string,\n", + " \"asyncpg\",\n", + " user=\"postgres\",\n", + " password=password,\n", + " db=db_name,\n", + " ip_type=IPTypes.PUBLIC,\n", + " )\n", + " return conn\n", + "\n", + " pool = create_async_engine(\n", + " \"postgresql+asyncpg://\",\n", + " async_creator=getconn,\n", + " pool_size=pool_size,\n", + " max_overflow=0,\n", + " )\n", + " return pool" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "i_yNN1MnJpTR" + }, + "source": [ + "### Create a Database\n", + "\n", + "Nex, you will create database to store the data using the connection pool. Enabling public IP takes a few minutes, you may get an error that there is no public IP address. Please wait and retry this step if you hit an error!" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "id": "7PX05ndo_AMc", + "outputId": "0931754a-aeb8-4895-e0b5-eeb01ffe5506" + }, + "outputs": [], + "source": [ + "from sqlalchemy.ext.asyncio import AsyncEngine, create_async_engine\n", + "from sqlalchemy import text, exc\n", + "\n", + "from google.cloud.alloydb.connector import AsyncConnector, IPTypes\n", + "\n", + "async def create_db(database_name):\n", + " # Get a raw connection directly from the connector\n", + " connector = AsyncConnector()\n", + " connection_string = f\"projects/{project_id}/locations/{region}/clusters/{cluster_name}/instances/{instance_name}\"\n", + " pool = await init_connection_pool(connector, \"postgres\")\n", + " async with pool.connect() as conn:\n", + " try:\n", + " await conn.execute(text(\"COMMIT\")) # end transaction\n", + " await conn.execute(text(f\"CREATE DATABASE {database_name}\"))\n", + " print(f\"Database '{database_name}' created successfully\")\n", + " except exc.ProgrammingError:\n", + " print(f\"Database '{database_name}' already exists\")\n", + "\n", + "await create_db(database_name=database_name)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "HdolCWyatZmG" + }, + "source": [ + "### Download data\n", + "\n", + "The following code has been prepared code to help insert the CSV data into your AlloyDB for PostgreSQL database." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "Dzr-2VZIkvtY" + }, + "source": [ + "Download the CSV file:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "id": "5KkIQ2zSvQkN", + "outputId": "f1980d73-4171-4fb1-b912-164187ba283b" + }, + "outputs": [], + "source": [ + "!gcloud storage cp gs://cloud-samples-data/alloydb/investments_data ./investments.csv" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "oFU13dCBlYHh" + }, + "source": [ + "The download can be verified by the following command or using the \"Files\" tab." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "id": "nQBs10I8vShh", + "outputId": "e81e933b-819d-46ac-f4de-6a1f943faa48" + }, + "outputs": [], + "source": [ + "!ls" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "2H7rorG9Ivur" + }, + "source": [ + "In this next step you will:\n", + "\n", + "1. Create the table into store data\n", + "2. And insert the data from the CSV into the database table" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "r16wPmxOBn_r" + }, + "source": [ + "### Import data to your database\n" + ] + }, + { + "cell_type": "code", + "execution_count": 19, + "metadata": { + "id": "v1pi9-8tB_pH" + }, + "outputs": [], + "source": [ + "# Prepare data\n", + "import pandas as pd\n", + "\n", + "data = \"./investments.csv\"\n", + "\n", + "df = pd.read_csv(data)\n", + "df['etf'] = df['etf'].map({'t': True, 'f': False})\n", + "df['rating'] = df['rating'].astype(str).fillna('')" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/", + "height": 345 + }, + "id": "4R6tzuUtLypO", + "outputId": "270d5fcd-b62d-4e3c-8c4e-25428798a350" + }, + "outputs": [], + "source": [ + "df.head()" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "UstTWGJyL7j-" + }, + "source": [ + "The data consists of the following columns:\n", + "\n", + "* **id**\n", + "* **ticker**: A string representing the stock symbol or ticker (e.g., \"AAPL\" for Apple, \"GOOG\" for Google).\n", + "* **etf**: A boolean value indicating whether the asset is an ETF (True) or not (False).\n", + "* **market**: A string representing the stock exchange where the asset is traded.\n", + "* **rating**: Whether to hold, buy or sell a stock.\n", + "* **overview**: A text field for a general overview or description of the asset.\n", + "* **analysis**: A text field, for a more detailed analysis of the asset.\n", + "* **overview_embedding** (empty)\n", + "* **analysis_embedding** (empty)" + ] + }, + { + "cell_type": "code", + "execution_count": 21, + "metadata": { + "id": "KqpLkwbWCJaw" + }, + "outputs": [], + "source": [ + "create_table_cmd = sqlalchemy.text(\n", + " f'CREATE TABLE {table_name} ( \\\n", + " id SERIAL PRIMARY KEY, \\\n", + " ticker VARCHAR(255) NOT NULL UNIQUE, \\\n", + " etf BOOLEAN, \\\n", + " market VARCHAR(255), \\\n", + " rating TEXT, \\\n", + " overview TEXT, \\\n", + " overview_embedding VECTOR (768), \\\n", + " analysis TEXT, \\\n", + " analysis_embedding VECTOR (768) \\\n", + " )'\n", + ")\n", + "\n", + "\n", + "insert_data_cmd = sqlalchemy.text(\n", + " f\"\"\"\n", + " INSERT INTO {table_name} (id, ticker, etf, market,\n", + " rating, overview, analysis) VALUES (:id, :ticker, :etf, :market,\n", + " :rating, :overview, :analysis)\n", + " \"\"\"\n", + ")\n", + "\n", + "parameter_map = [\n", + " {\n", + " \"id\": row[\"id\"],\n", + " \"ticker\": row[\"ticker\"],\n", + " \"etf\": row[\"etf\"],\n", + " \"market\": row[\"market\"],\n", + " \"rating\": row[\"rating\"],\n", + " \"overview\": row[\"overview\"],\n", + " \"analysis\": row[\"analysis\"],\n", + " }\n", + " for index, row in df.iterrows()\n", + "]" + ] + }, + { + "cell_type": "code", + "execution_count": 22, + "metadata": { + "id": "qCsM2KXbdYiv" + }, + "outputs": [], + "source": [ + "from google.cloud.alloydb.connector import AsyncConnector\n", + "\n", + "connector = AsyncConnector()\n", + "\n", + "# Create table and insert data\n", + "async def insert_data(pool):\n", + " async with pool.connect() as db_conn:\n", + " await db_conn.execute(sqlalchemy.text(\"CREATE EXTENSION IF NOT EXISTS vector;\"))\n", + " await db_conn.execute(create_table_cmd)\n", + " await db_conn.execute(\n", + " insert_data_cmd,\n", + " parameter_map,\n", + " )\n", + " await db_conn.commit()\n", + "\n", + "pool = await init_connection_pool(connector, database_name)\n", + "await insert_data(pool)\n", + "await pool.dispose()" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "IaC8uhlfEwam" + }, + "source": [ + "## Create the embeddings workflow\n", + "\n", + "The embeddings workflow contains four major parts:\n", + "1. Read the data\n", + "2. Batch the data\n", + "3. Generate embeddings\n", + "4. Update original table\n" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "oIk5GxbnFaE3" + }, + "source": [ + "#### Step 0: Configure Logging" + ] + }, + { + "cell_type": "code", + "execution_count": 23, + "metadata": { + "id": "wvYGGRRoFXl4" + }, + "outputs": [], + "source": [ + "import logging\n", + "import sys\n", + "\n", + "# Configure the root logger to output messages with INFO level or above\n", + "logging.basicConfig(level=logging.INFO, stream=sys.stdout, format='%(asctime)s[%(levelname)5s][%(name)14s] - %(message)s', datefmt='%H:%M:%S', force=True)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "ekrEM22pJ2df" + }, + "source": [ + "#### Step 1: Read the data\n", + "\n", + "This code reads data from a database and yields it for further processing." + ] + }, + { + "cell_type": "code", + "execution_count": 24, + "metadata": { + "id": "IZgMik9XBW19" + }, + "outputs": [], + "source": [ + "from typing import AsyncIterator, List\n", + "from sqlalchemy import RowMapping\n", + "from sqlalchemy.ext.asyncio import AsyncEngine\n", + "\n", + "async def get_source_data(pool: AsyncEngine, embed_cols: List[str]) -> AsyncIterator[RowMapping]:\n", + " \"\"\"\n", + " Yields data in the form of:\n", + " {'id' : 'id1', 'col1': 'val1', 'col2': 'val2'}\n", + " where col1 and col2 are columns containing data to be embedded.\n", + " \"\"\"\n", + " logger = logging.getLogger('get_source_data')\n", + "\n", + " sql = f\"SELECT id, {', '.join(embed_cols)} FROM {table_name}\"\n", + " logger.info(f\"Running SQL query: {sql}\")\n", + " async with pool.connect() as conn:\n", + " async for row in await conn.stream(text(sql)):\n", + " logger.debug(f\"yielded row: {row._mapping['id']}\")\n", + " # Yield the row as a dictionary (RowMapping)\n", + " yield row._mapping" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "Kg54pvhjJ5kL" + }, + "source": [ + "#### Step 2: Batch the data\n", + "\n", + "This code defines a function called `batch_source_data` that takes database rows and groups them into batches based on a character count limit (max_char_count). This batching process is crucial for efficient embedding generation for these reasons:\n", + "\n", + "* **Resource Optimization:** Instead of sending numerous small requests, batching allows us to send fewer, larger requests. This significantly optimizes resource usage and potentially reduces API costs.\n", + "\n", + "* **Working Within API Limits:** The max_char_count limit ensures each batch stays within the API's acceptable input size, preventing issues with exceeding the maximum character limit.\n" + ] + }, + { + "cell_type": "code", + "execution_count": 25, + "metadata": { + "id": "76qq6G38CZfm" + }, + "outputs": [], + "source": [ + "from typing import Any, List\n", + "import asyncio\n", + "\n", + "async def batch_source_data(read_generator: AsyncIterator[RowMapping], embed_cols: List[str]) -> AsyncIterator[List[dict[str, Any]]]:\n", + " \"\"\"\n", + " Yields data in the form of:\n", + " [\n", + " {'id' : 'id1', 'col1': 'val1', 'col2': 'val2'},\n", + " ...\n", + " ]\n", + " where col1 and col2 are columns containing data to be embedded.\n", + " \"\"\"\n", + " logger = logging.getLogger('batch_data')\n", + "\n", + " global max_char_count\n", + "\n", + " batch = []\n", + " char_count = 0\n", + " batch_num = 0\n", + "\n", + " async for row in read_generator:\n", + " # Char count in current row\n", + " row_char_count = sum(len(row[col]) for col in embed_cols)\n", + "\n", + " if char_count + row_char_count > max_char_count:\n", + " batch_num += 1\n", + " logger.info(f\"yielded batch number: {batch_num} with length: {len(batch)}\")\n", + " yield batch\n", + " batch, char_count = [], 0\n", + "\n", + " # Add the current row to the batch\n", + " batch.append(row)\n", + " char_count += row_char_count\n", + "\n", + " if batch:\n", + " batch_num += 1\n", + " logger.info(f\"Yielded batch number: {batch_num} with length: {len(batch)}\")\n", + " yield batch" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "_L4EnrleJ8gy" + }, + "source": [ + "#### Step 3: Generate embeddings\n", + "\n", + "This step converts your text data into numerical representations called \"embeddings.\" These embeddings capture the meaning and relationships between words, making them useful for various tasks like search, recommendations, and clustering.\n", + "\n", + "The code uses two functions to efficiently generate embeddings:\n", + "\n", + "**embed_text**\n", + "\n", + "This function your text data and sends it to vertex AI, transforming the text in specific columns into embeddings.\n", + "\n", + "**embed_objects_concurrently**\n", + "\n", + "This function is the orchestrator. It manages the embedding generation process for multiple batches of text concurrently. This function ensures that all batches are processed efficiently without overwhelming the system." + ] + }, + { + "cell_type": "code", + "execution_count": 26, + "metadata": { + "id": "4OYdrJk9Co0v" + }, + "outputs": [], + "source": [ + "from google.api_core.exceptions import ResourceExhausted\n", + "from typing import Union\n", + "from vertexai.language_models import TextEmbeddingInput, TextEmbeddingModel\n", + "\n", + "async def embed_text(\n", + " logger: logging.Logger,\n", + " batch_data: List[dict[str, Any]],\n", + " model: TextEmbeddingModel,\n", + " cols_to_embed: List[str],\n", + " task_type: str = \"SEMANTIC_SIMILARITY\",\n", + " retries: int = 100,\n", + " delay: int = 1,\n", + ") -> List[dict[str, Union[List[float], str]]]:\n", + " \"\"\"\n", + " Returns data in the form of:\n", + " [\n", + " {\n", + " 'id': 'id1',\n", + " 'col1_embedding': [1.0, 1.1, ...],\n", + " 'col2_embedding': [2.0, 2.1, ...],\n", + " ...\n", + " },\n", + " ...\n", + " ]\n", + " where col1 and col2 are columns containing data to be embedded.\n", + " \"\"\"\n", + " global total_char_count\n", + "\n", + " # Place all of the embeddings into a single list\n", + " inputs = []\n", + " for data in batch_data:\n", + " inputs.extend(\n", + " TextEmbeddingInput(data[col], task_type) for col in cols_to_embed\n", + " )\n", + "\n", + " for attempt in range(retries): # Retry loop\n", + " try:\n", + " # Get embeddings for the text data\n", + " embeddings = await model.get_embeddings_async(inputs)\n", + "\n", + " # Increase total char count\n", + " total_char_count += sum([len(input.text) for input in inputs])\n", + "\n", + " # group the results together by id\n", + " embedding_iter = iter(embeddings)\n", + " results = []\n", + " for row in batch_data:\n", + " r = { 'id': row['id'] }\n", + " for col in cols_to_embed:\n", + " r[f'{col}_embedding'] = str(next(embedding_iter).values)\n", + " results.append(r)\n", + " return results\n", + "\n", + " except ResourceExhausted as e:\n", + " if attempt < retries - 1: # Retry only if attempts are left\n", + " logger.warning(f\"Error: {e}. Retrying in {delay} seconds...\")\n", + " await asyncio.sleep(delay) # Wait before retrying\n", + " else:\n", + " logger.error(f\"Failed to get embeddings after {retries} attempts.\")\n", + " raise # Raise the error if all retries fail\n", + "\n", + " return []\n", + "\n", + "async def embed_objects_concurrently(\n", + " cols_to_embed: List[str],\n", + " batch_data: AsyncIterator[List[dict[str, Any]]],\n", + " model: TextEmbeddingModel,\n", + " task_type: str,\n", + " max_concurrency: int = 5,\n", + ") -> AsyncIterator[List[dict[str, Union[str, List[float]]]]]:\n", + " \"\"\"\n", + " Embeds text from objects concurrently with a maximum concurrency limit. This\n", + " function processes batches of data concurrently, limiting the number of\n", + " simultaneous embedding tasks to improve efficiency and resource utilization.\n", + " \"\"\"\n", + " logger = logging.getLogger('embed_objects')\n", + " # Keep track of pending tasks\n", + " pending: set[asyncio.Task] = set()\n", + " has_next = True\n", + " while pending or has_next:\n", + " while len(pending) < max_concurrency and has_next:\n", + " try:\n", + " data = await batch_data.__anext__() \n", + " coro = embed_text(logger, data, model, cols_to_embed, task_type)\n", + " pending.add(asyncio.ensure_future(coro))\n", + " except StopAsyncIteration:\n", + " has_next = False\n", + "\n", + " done, pending = await asyncio.wait(\n", + " pending, return_when=asyncio.FIRST_COMPLETED\n", + " )\n", + "\n", + " for task in done:\n", + " result = task.result()\n", + " logger.info(f\"Embedding task completed: Processed {len(result)} rows.\")\n", + " yield result\n" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "FjErJPrJKA2j" + }, + "source": [ + "#### Step 4: Update original table\n", + "\n", + "After generating embeddings for your text data, you need to store them in your database. This step efficiently updates your original table with the newly created embeddings.\n", + "\n", + "This process uses two functions to manage database updates:\n", + "\n", + "**batch_update_rows**\n", + "1. This function takes a batch of data (including the embeddings) and updates the corresponding rows in your database table.\n", + "2. It constructs an SQL UPDATE query to modify specific columns with the embedding values.\n", + "3. It ensures that the updates are done efficiently and correctly within a database transaction.\n", + "\n", + "\n", + "**batch_update_rows_concurrently**\n", + "\n", + "1. This function handles the concurrent updating of multiple batches of data.\n", + "2. It creates multiple \"tasks\" that each execute the batch_update_rows function on a separate batch.\n", + "3. It limits the number of concurrent tasks to avoid overloading your database and system resources.\n", + "4. It manages the execution of these tasks, ensuring that all batches are processed efficiently." + ] + }, + { + "cell_type": "code", + "execution_count": 27, + "metadata": { + "id": "lEyvhlOCCr7F" + }, + "outputs": [], + "source": [ + "from sqlalchemy import text\n", + "\n", + "async def batch_update_rows(pool: AsyncEngine, logger: logging.Logger, data: List[dict[str, Any]], cols_to_embed: List[str]) -> None:\n", + " update_query = f\"\"\"\n", + " UPDATE {table_name}\n", + " SET {', '.join([f'{col}_embedding = :{col}_embedding' for col in cols_to_embed])}\n", + " WHERE id = :id;\n", + " \"\"\"\n", + "\n", + " async with pool.connect() as conn:\n", + " await conn.execute(\n", + " text(update_query),\n", + " # Create parameters for all rows in the data\n", + " parameters = data,\n", + " )\n", + " await conn.commit()\n", + " logger.info(f\"Updated {len(data)} rows in database.\")\n", + "\n", + "\n", + "async def batch_update_rows_concurrently(\n", + " pool: AsyncEngine,\n", + " embed_data: AsyncIterator[List[dict[str, Any]]],\n", + " cols_to_embed: List[str],\n", + " max_concurrency: int = 5\n", + "):\n", + " logger = logging.getLogger('update_rows')\n", + " # Keep track of pending tasks\n", + " pending: set[asyncio.Task] = set()\n", + " has_next = True\n", + " while pending or has_next:\n", + " while len(pending) < max_concurrency and has_next:\n", + " try:\n", + " data = await embed_data.__anext__() \n", + " coro = batch_update_rows(pool, logger, data, cols_to_embed)\n", + " pending.add(asyncio.ensure_future(coro))\n", + " except StopAsyncIteration:\n", + " has_next = False\n", + "\n", + " done, pending = await asyncio.wait(\n", + " pending, return_when=asyncio.FIRST_COMPLETED\n", + " )\n", + "\n", + " logger.info(\"All database update tasks completed.\")" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "HSv4DwzbJc5J" + }, + "source": [ + "## Run the embeddings workflow\n", + "\n" + ] + }, + { + "cell_type": "code", + "execution_count": 28, + "metadata": { + "id": "rWb1T9aIBWa-" + }, + "outputs": [], + "source": [ + "# Max token count for the embeddings API\n", + "max_tokens = 20000\n", + "\n", + "# For some tokenizers and text, there's a rough approximation that 1 token corresponds to about 3-4 characters. This is a very general guideline and can vary significantly.\n", + "max_char_count = max_tokens * 3\n", + "\n", + "cols_to_embed = ['analysis','overview']\n", + "\n", + "# Model to use for generating embeddings\n", + "model_name = 'text-embedding-004'\n", + "\n", + "# Generate optimised embeddings for a given task\n", + "# Ref: https://cloud.google.com/vertex-ai/generative-ai/docs/embeddings/task-types#supported_task_types\n", + "task = \"SEMANTIC_SIMILARITY\"" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "D3FUBaXIUquR" + }, + "source": [ + "This runs the complete embeddings workflow:\n", + "\n", + "1. Gettting source data\n", + "2. Batching source data\n", + "3. Generating embeddings for batches\n", + "4. Updating data batches in the original table" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "id": "syO1Zq3o5PnI", + "outputId": "8db5edfc-7b9e-46da-bda8-123444033b37" + }, + "outputs": [], + "source": [ + "import vertexai\n", + "import time\n", + "import asyncio\n", + "from vertexai.language_models import TextEmbeddingModel\n", + "\n", + "pool_size = 10\n", + "embed_data_concurrency = 20\n", + "batch_update_concurrency = 10\n", + "total_char_count = 0\n", + "\n", + "# Set up connections to the database\n", + "connector = AsyncConnector()\n", + "pool = await init_connection_pool(connector, database_name, pool_size=pool_size)\n", + "\n", + "# Initialise VertexAI and the model to be used to generate embeddings\n", + "vertexai.init(project=project_id, location=region)\n", + "model = TextEmbeddingModel.from_pretrained(model_name)\n", + "\n", + "start_time = time.monotonic()\n", + "\n", + "# Fetch source data from the database\n", + "source_data = get_source_data(pool, cols_to_embed)\n", + "\n", + "# Divide the source data into batches for efficient processing\n", + "batch_data = batch_source_data(source_data, cols_to_embed)\n", + "\n", + "# Generate embeddings for the batched data concurrently\n", + "embeddings_data = embed_objects_concurrently(cols_to_embed, batch_data, model, task, max_concurrency=embed_data_concurrency)\n", + "\n", + "# Update the database with the generated embeddings concurrently\n", + "await batch_update_rows_concurrently(pool, embeddings_data, cols_to_embed, max_concurrency=batch_update_concurrency)\n", + "\n", + "end_time = time.monotonic()\n", + "elapsed_time = end_time - start_time\n", + "\n", + "# Release database connections and close the connector\n", + "await pool.dispose()\n", + "await connector.close()\n", + "\n", + "print(f\"Job started at: {time.ctime(start_time)}\")\n", + "print(f\"Job ended at: {time.ctime(end_time)}\")\n", + "print(f\"Total run time: {elapsed_time:.2f} seconds\")\n", + "print(f\"Total characters embedded: {total_char_count}\")" + ] + }, + { + "cell_type": "code", + "execution_count": 41, + "metadata": { + "id": "fzZJsWRZAMxs" + }, + "outputs": [], + "source": [] + } + ], + "metadata": { + "colab": { + "provenance": [] + }, + "kernelspec": { + "display_name": "python-docs-samples", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.9.16" + } + }, + "nbformat": 4, + "nbformat_minor": 0 +} diff --git a/alloydb/notebooks/generate_batch_embeddings.ipynb b/alloydb/notebooks/generate_batch_embeddings.ipynb deleted file mode 100644 index 33a75471c9c8..000000000000 --- a/alloydb/notebooks/generate_batch_embeddings.ipynb +++ /dev/null @@ -1,2589 +0,0 @@ -{ - "cells": [ - { - "cell_type": "code", - "execution_count": 2, - "metadata": { - "id": "upi2EY4L9ei3" - }, - "outputs": [], - "source": [ - "# Copyright 2024 Google LLC\n", - "#\n", - "# Licensed under the Apache License, Version 2.0 (the \"License\");\n", - "# you may not use this file except in compliance with the License.\n", - "# You may obtain a copy of the License at\n", - "#\n", - "# https://www.apache.org/licenses/LICENSE-2.0\n", - "#\n", - "# Unless required by applicable law or agreed to in writing, software\n", - "# distributed under the License is distributed on an \"AS IS\" BASIS,\n", - "# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n", - "# See the License for the specific language governing permissions and\n", - "# limitations under the License." - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "mbF2F2miAT4a" - }, - "source": [ - "[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/GoogleCloudPlatform/python-docs-samples/blob/main/alloydb/notebooks/generate_batch_embeddings.ipynb)\n", - "\n", - "---\n", - "# Introduction\n", - "\n", - "This notebook shows you how to batch generate vector embeddings and store them in an AlloyDB database.\n", - "\n", - "With the steps listed here, you can dynamically build a batch of text chunks to embed based on character length of the source data in order to get more results per inference, leading to much more efficient embeddings generation. The process uses Asyncio to efficiently load the embeddings into AlloyDB after they are generated. These techniques can significantly speed up the process of generating large batches of embeddings and storing them in AlloyDB." - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "FbcZUjT1yvTq" - }, - "source": [ - "## What you'll need\n", - "\n", - "* A Google Cloud Account and Google Cloud Project" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "vHdR4fF3vLWA" - }, - "source": [ - "## Setup and Requirements\n", - "\n", - "In the following instructions you will learn to:\n", - "\n", - "1. Install required dependencies for our application\n", - "2. Set up authentication for our project\n", - "3. Set up a AlloyDB for PostgreSQL Instance\n", - "4. Import the data used by our application" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "uy9KqgPQ4GBi" - }, - "source": [ - "### Install dependencies" - ] - }, - { - "cell_type": "code", - "execution_count": 1, - "metadata": { - "id": "M_ppDxYf4Gqs" - }, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "\u001b[33mWARNING: google-cloud-aiplatform 1.70.0 does not provide the extra 'all'\u001b[0m\u001b[33m\n", - "\u001b[0m\u001b[33mWARNING: You are using pip version 22.0.4; however, version 24.2 is available.\n", - "You should consider upgrading via the '/Users/twishabansal/Documents/forks/python-docs-samples/bin/python -m pip install --upgrade pip' command.\u001b[0m\u001b[33m\n", - "\u001b[0mNote: you may need to restart the kernel to use updated packages.\n" - ] - } - ], - "source": [ - "%pip install google-cloud-alloydb-connector[asyncpg]==1.4.0 sqlalchemy==2.0.36 pandas==2.2.3 vertexai==1.70.0 asyncio==3.4.3 greenlet==3.1.1 --quiet" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "### Authenticate to Google Cloud within Colab\n", - "If you're running this on google colab notebook, you will need to Authenticate as an IAM user." - ] - }, - { - "cell_type": "code", - "execution_count": 12, - "metadata": {}, - "outputs": [], - "source": [ - "# from google.colab import auth\n", - "\n", - "# auth.authenticate_user()" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "UCiNGP1Qxd6x" - }, - "source": [ - "### Connect Your Google Cloud Project" - ] - }, - { - "cell_type": "code", - "execution_count": 2, - "metadata": { - "colab": { - "base_uri": "https://localhost:8080/" - }, - "id": "SLUGlG6UE2CK", - "outputId": "a284c046-00df-414a-9039-ddc5df12536d" - }, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "Updated property [core/project].\n" - ] - } - ], - "source": [ - "# @markdown Please fill in the value below with your GCP project ID and then run the cell.\n", - "\n", - "# Please fill in these values.\n", - "project_id = \"twisha-dev\" # @param {type:\"string\"}\n", - "\n", - "# Quick input validations.\n", - "assert project_id, \"⚠️ Please provide a Google Cloud project ID\"\n", - "\n", - "# Configure gcloud.\n", - "!gcloud config set project {project_id}" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "O-oqMC5Ox-ZM" - }, - "source": [ - "### Enable APIs for AlloyDB and Vertex AI within your project" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "X-bzfFb4A-xK" - }, - "source": [ - "You will need to enable these APIs in order to create an AlloyDB database and utilize Vertex AI as an embeddings service!" - ] - }, - { - "cell_type": "code", - "execution_count": 5, - "metadata": { - "colab": { - "base_uri": "https://localhost:8080/" - }, - "id": "CKWrwyfzyTwH", - "outputId": "f5131e77-2750-4cb1-b153-c52a13aaf284" - }, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "Operation \"operations/acat.p2-1025562370644-8b00bf5d-6723-43a4-924d-2c6bb1d03508\" finished successfully.\n" - ] - } - ], - "source": [ - "# enable GCP services\n", - "!gcloud services enable alloydb.googleapis.com aiplatform.googleapis.com" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "Gn8g7-wCyZU6" - }, - "source": [ - "## Set up AlloyDB\n", - "You will need a Postgres AlloyDB instance for the following stages of this notebook. Please set the following variables." - ] - }, - { - "cell_type": "code", - "execution_count": 3, - "metadata": { - "colab": { - "base_uri": "https://localhost:8080/" - }, - "id": "8q2lc-Po1mPv", - "outputId": "e268aea8-0514-4308-f5c7-1916031255b7" - }, - "outputs": [], - "source": [ - "# @markdown Please fill in the both the Google Cloud region and name of your AlloyDB instance. Once filled in, run the cell.\n", - "\n", - "# Please fill in these values.\n", - "region = \"us-central1\" # @param {type:\"string\"}\n", - "cluster_name = \"twisha-dev-cluster\" # @param {type:\"string\"}\n", - "instance_name = \"my-primary\" # @param {type:\"string\"}\n", - "database_name = \"testdb\" # @param {type:\"string\"}\n", - "table_name = \"investments\"\n", - "password = input(\"Please provide a password to be used for 'postgres' database user: \")" - ] - }, - { - "cell_type": "code", - "execution_count": 6, - "metadata": { - "id": "XXI1uUu3y8gc" - }, - "outputs": [], - "source": [ - "# Quick input validations.\n", - "assert region, \"⚠️ Please provide a Google Cloud region\"\n", - "assert instance_name, \"⚠️ Please provide the name of your instance\"\n", - "assert database_name, \"⚠️ Please provide the name of your database_name\"" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "T616pEOUygYQ" - }, - "source": [ - "### Create an AlloyDB Instance\n", - "If you have already created an AlloyDB Cluster and Instance, you can skip these steps and skip to the Create a database section." - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "xyZYX4Jo1vfh" - }, - "source": [ - "> ⏳ - Creating an AlloyDB cluster may take a few minutes." - ] - }, - { - "cell_type": "code", - "execution_count": 8, - "metadata": { - "colab": { - "base_uri": "https://localhost:8080/" - }, - "id": "MQYni0NlTLzC", - "outputId": "118d9a2b-2d9d-44ae-a33f-fb89ed6a2895" - }, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "Operation ID: operation-1729665156428-6251f0d3ab2eb-9c3ec3a5-10ef23ab\n" - ] - } - ], - "source": [ - "# create the AlloyDB Cluster\n", - "!gcloud beta alloydb clusters create {cluster_name} --password={password} --region={region}" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "o8LkscYH5Vfp" - }, - "source": [ - "Create an instance attached to our cluster with the following command.\n", - "> ⏳ - Creating an AlloyDB instance may take a few minutes." - ] - }, - { - "cell_type": "code", - "execution_count": 9, - "metadata": { - "colab": { - "base_uri": "https://localhost:8080/" - }, - "id": "TkqQSWoY5Kab", - "outputId": "78e02d10-5e14-457a-86c6-21348898bd0a" - }, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "Operation ID: operation-1729665166396-6251f0dd2cd62-dc533934-2eadd06b\n" - ] - } - ], - "source": [ - "!gcloud beta alloydb instances create {instance_name} --instance-type=PRIMARY --cpu-count=2 --region={region} --cluster={cluster_name}" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "BXsQ1UJv4ZVJ" - }, - "source": [ - "To connect to your AlloyDB instance from this notebook, you will need to enable public IP on your instance. Alternatively, you can follow [these instructions](https://cloud.google.com/alloydb/docs/connect-external) to connect to an AlloyDB for PostgreSQL instance with Private IP from outside your VPC." - ] - }, - { - "cell_type": "code", - "execution_count": 10, - "metadata": { - "colab": { - "base_uri": "https://localhost:8080/" - }, - "id": "OPVWsQB04Yyl", - "outputId": "79f213ac-a069-4b15-e949-189f166dfca1" - }, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "Operation ID: operation-1729665591178-6251f272475e4-aba1efce-e0d09dfa\n" - ] - } - ], - "source": [ - "!gcloud beta alloydb instances update {instance_name} --region={region} --cluster={cluster_name} --assign-inbound-public-ip=ASSIGN_IPV4 --database-flags=\"password.enforce_complexity=on\"" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "UabC_qh5HOVy" - }, - "source": [ - "Please wait for the instance to be updated. This might take some time. You can see if the changes are reflecting using:" - ] - }, - { - "cell_type": "code", - "execution_count": 27, - "metadata": { - "colab": { - "base_uri": "https://localhost:8080/" - }, - "id": "_KC91mQZHABv", - "outputId": "6da8a6e4-549b-428d-a488-8dc993ddd216" - }, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "availabilityType: REGIONAL\n", - "clientConnectionConfig:\n", - " sslConfig:\n", - " sslMode: ENCRYPTED_ONLY\n", - "createTime: '2024-10-23T06:32:46.512121934Z'\n", - "geminiConfig:\n", - " entitled: true\n", - "instanceType: PRIMARY\n", - "ipAddress: 10.124.48.14\n", - "machineConfig:\n", - " cpuCount: 2\n", - "name: projects/twisha-dev/locations/us-central1/clusters/embeddings-test-cluster/instances/my-primary\n", - "nodes:\n", - "- zoneId: us-central1-b\n", - "observabilityConfig:\n", - " enabled: false\n", - " maxQueryStringLength: 10240\n", - " preserveComments: false\n", - " queryPlansPerMinute: 200\n", - " recordApplicationTags: false\n", - " trackActiveQueries: false\n", - " trackClientAddress: false\n", - " trackWaitEventTypes: true\n", - " trackWaitEvents: true\n", - "publicIpAddress: 34.172.203.210\n", - "queryInsightsConfig:\n", - " queryPlansPerMinute: 5\n", - " queryStringLength: 1024\n", - " recordApplicationTags: false\n", - " recordClientAddress: false\n", - "reconciling: true\n", - "state: READY\n", - "uid: d6a0e50d-2f88-4dd1-b250-0f717c1dcf95\n", - "updateTime: '2024-10-23T06:40:26.317060059Z'\n", - "writableNode:\n", - " zoneId: us-central1-a\n" - ] - } - ], - "source": [ - "!gcloud beta alloydb instances describe {instance_name} --region={region} --cluster={cluster_name}" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "_K86id-dcjcm" - }, - "source": [ - "### Connect to AlloyDB\n", - "\n", - "This function will create a connection pool to your AlloyDB instance using the [AlloyDB Python connector](https://github.com/GoogleCloudPlatform/alloydb-python-connector). The AlloyDB Python connector will automatically create secure connections to your AlloyDB instance using mTLS." - ] - }, - { - "cell_type": "code", - "execution_count": 4, - "metadata": { - "id": "fYKVQzv2cjcm" - }, - "outputs": [], - "source": [ - "import asyncpg\n", - "\n", - "import sqlalchemy\n", - "from sqlalchemy.ext.asyncio import AsyncEngine, create_async_engine\n", - "\n", - "from google.cloud.alloydb.connector import AsyncConnector, IPTypes\n", - "\n", - "async def init_connection_pool(connector: AsyncConnector, db_name: str, pool_size: int = 5) -> AsyncEngine:\n", - " # initialize Connector object for connections to AlloyDB\n", - " connection_string = f\"projects/{project_id}/locations/{region}/clusters/{cluster_name}/instances/{instance_name}\"\n", - "\n", - " async def getconn() -> asyncpg.Connection:\n", - " conn: asyncpg.Connection = await connector.connect(\n", - " connection_string,\n", - " \"asyncpg\",\n", - " user=\"postgres\",\n", - " password=password,\n", - " db=db_name,\n", - " ip_type=IPTypes.PUBLIC,\n", - " )\n", - " return conn\n", - "\n", - " pool = create_async_engine(\n", - " \"postgresql+asyncpg://\",\n", - " async_creator=getconn,\n", - " pool_size=pool_size,\n", - " max_overflow=0,\n", - " )\n", - " return pool" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "i_yNN1MnJpTR" - }, - "source": [ - "### Create a Database\n", - "\n", - "Nex, you will create database to store the data using the connection pool. Enabling public IP takes a few minutes, you may get an error that there is no public IP address. Please wait and retry this step if you hit an error!" - ] - }, - { - "cell_type": "code", - "execution_count": 28, - "metadata": { - "colab": { - "base_uri": "https://localhost:8080/" - }, - "id": "7PX05ndo_AMc", - "outputId": "0931754a-aeb8-4895-e0b5-eeb01ffe5506" - }, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "Database 'test_db' created successfully\n" - ] - } - ], - "source": [ - "from sqlalchemy.ext.asyncio import AsyncEngine, create_async_engine\n", - "from sqlalchemy import text, exc\n", - "\n", - "from google.cloud.alloydb.connector import AsyncConnector, IPTypes\n", - "\n", - "async def create_db(database_name):\n", - " # Get a raw connection directly from the connector\n", - " connector = AsyncConnector()\n", - " connection_string = f\"projects/{project_id}/locations/{region}/clusters/{cluster_name}/instances/{instance_name}\"\n", - " pool = await init_connection_pool(connector, \"postgres\")\n", - " async with pool.connect() as conn:\n", - " try:\n", - " await conn.execute(text(\"COMMIT\")) # end transaction\n", - " await conn.execute(text(f\"CREATE DATABASE {database_name}\"))\n", - " print(f\"Database '{database_name}' created successfully\")\n", - " except exc.ProgrammingError:\n", - " print(f\"Database '{database_name}' already exists\")\n", - "\n", - "await create_db(database_name=database_name)" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "HdolCWyatZmG" - }, - "source": [ - "### Download data\n", - "\n", - "The following code has been prepared code to help insert the CSV data into your AlloyDB for PostgreSQL database." - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "Dzr-2VZIkvtY" - }, - "source": [ - "Download the CSV file:" - ] - }, - { - "cell_type": "code", - "execution_count": 17, - "metadata": { - "colab": { - "base_uri": "https://localhost:8080/" - }, - "id": "5KkIQ2zSvQkN", - "outputId": "f1980d73-4171-4fb1-b912-164187ba283b" - }, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "Copying gs://cloud-samples-data/alloydb/investments_data to file://./investments.csv\n", - " Completed files 1/1 | 22.8MiB/22.8MiB \n", - "\n", - "Average throughput: 10.0MiB/s\n" - ] - } - ], - "source": [ - "!gcloud storage cp gs://cloud-samples-data/alloydb/investments_data ./investments.csv" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "oFU13dCBlYHh" - }, - "source": [ - "The download can be verified by the following command or using the \"Files\" tab." - ] - }, - { - "cell_type": "code", - "execution_count": 18, - "metadata": { - "colab": { - "base_uri": "https://localhost:8080/" - }, - "id": "nQBs10I8vShh", - "outputId": "e81e933b-819d-46ac-f4de-6a1f943faa48" - }, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "generate_batch_embeddings.ipynb investments.csv\n", - "generate_batch_embeddings2.ipynb run.sh\n" - ] - } - ], - "source": [ - "!ls" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "2H7rorG9Ivur" - }, - "source": [ - "In this next step you will:\n", - "\n", - "1. Create the table into store data\n", - "2. And insert the data from the CSV into the database table" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "r16wPmxOBn_r" - }, - "source": [ - "### Import data to your database\n" - ] - }, - { - "cell_type": "code", - "execution_count": 23, - "metadata": { - "id": "v1pi9-8tB_pH" - }, - "outputs": [], - "source": [ - "# Prepare data\n", - "import pandas as pd\n", - "\n", - "data = \"./investments.csv\"\n", - "\n", - "df = pd.read_csv(data)\n", - "df['etf'] = df['etf'].map({'t': True, 'f': False})\n", - "df['rating'] = df['rating'].astype(str).fillna('')" - ] - }, - { - "cell_type": "code", - "execution_count": 24, - "metadata": { - "colab": { - "base_uri": "https://localhost:8080/", - "height": 345 - }, - "id": "4R6tzuUtLypO", - "outputId": "270d5fcd-b62d-4e3c-8c4e-25428798a350" - }, - "outputs": [ - { - "data": { - "text/html": [ - "
\n", - "\n", - "\n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - "
idtickeretfmarketratingoverviewoverview_embeddinganalysisanalysis_embedding
05807QLCFalseUSHOLD**Ticker:** QLC\\n**Company:** Global X NASDAQ...NaN**Investment Rating: Hold**\\n\\n**Investment R...NaN
17762TTEKFalseUSHOLD**Ticker Symbol:** TTEK\\n\\n**Company Name:** ...NaN**Investment Rating: Hold**\\n\\nTwilio (TTEK) ...NaN
27763NWEFalseUSHOLD**Newell Brands Inc. (NWE)** is a leading glo...NaN**Investment Rating**: Hold\\n\\n**Investment R...NaN
35811SASRFalseUSHOLD**SASR: Sandbridge Acquisition Corp.**\\n\\nSan...NaN**Investment Rating: Hold**\\n\\n**Investment R...NaN
45813RUSHAFalseUSHOLD**RUSHA: iShares MSCI Russia ETF**\\n\\nThe iSh...NaN**Investment Rating: Hold**\\n\\nRUSHA is a sui...NaN
\n", - "
" - ], - "text/plain": [ - " id ticker etf market rating \\\n", - "0 5807 QLC False US HOLD \n", - "1 7762 TTEK False US HOLD \n", - "2 7763 NWE False US HOLD \n", - "3 5811 SASR False US HOLD \n", - "4 5813 RUSHA False US HOLD \n", - "\n", - " overview overview_embedding \\\n", - "0 **Ticker:** QLC\\n**Company:** Global X NASDAQ... NaN \n", - "1 **Ticker Symbol:** TTEK\\n\\n**Company Name:** ... NaN \n", - "2 **Newell Brands Inc. (NWE)** is a leading glo... NaN \n", - "3 **SASR: Sandbridge Acquisition Corp.**\\n\\nSan... NaN \n", - "4 **RUSHA: iShares MSCI Russia ETF**\\n\\nThe iSh... NaN \n", - "\n", - " analysis analysis_embedding \n", - "0 **Investment Rating: Hold**\\n\\n**Investment R... NaN \n", - "1 **Investment Rating: Hold**\\n\\nTwilio (TTEK) ... NaN \n", - "2 **Investment Rating**: Hold\\n\\n**Investment R... NaN \n", - "3 **Investment Rating: Hold**\\n\\n**Investment R... NaN \n", - "4 **Investment Rating: Hold**\\n\\nRUSHA is a sui... NaN " - ] - }, - "execution_count": 24, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "df.head()" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "UstTWGJyL7j-" - }, - "source": [ - "The data consists of the following columns:\n", - "\n", - "* **id**\n", - "* **ticker**: A string representing the stock symbol or ticker (e.g., \"AAPL\" for Apple, \"GOOG\" for Google).\n", - "* **etf**: A boolean value indicating whether the asset is an ETF (True) or not (False).\n", - "* **market**: A string representing the stock exchange where the asset is traded.\n", - "* **rating**: Whether to hold, buy or sell a stock.\n", - "* **overview**: A text field for a general overview or description of the asset.\n", - "* **analysis**: A text field, for a more detailed analysis of the asset.\n", - "* **overview_embedding** (empty)\n", - "* **analysis_embedding** (empty)" - ] - }, - { - "cell_type": "code", - "execution_count": 33, - "metadata": { - "id": "KqpLkwbWCJaw" - }, - "outputs": [], - "source": [ - "create_table_cmd = sqlalchemy.text(\n", - " f'CREATE TABLE {table_name} ( \\\n", - " id SERIAL PRIMARY KEY, \\\n", - " ticker VARCHAR(255) NOT NULL UNIQUE, \\\n", - " etf BOOLEAN, \\\n", - " market VARCHAR(255), \\\n", - " rating TEXT, \\\n", - " overview TEXT, \\\n", - " overview_embedding VECTOR (768), \\\n", - " analysis TEXT, \\\n", - " analysis_embedding VECTOR (768) \\\n", - " )'\n", - ")\n", - "\n", - "\n", - "insert_data_cmd = sqlalchemy.text(\n", - " f\"\"\"\n", - " INSERT INTO {table_name} (id, ticker, etf, market,\n", - " rating, overview, analysis) VALUES (:id, :ticker, :etf, :market,\n", - " :rating, :overview, :analysis)\n", - " \"\"\"\n", - ")\n", - "\n", - "parameter_map = [\n", - " {\n", - " \"id\": row[\"id\"],\n", - " \"ticker\": row[\"ticker\"],\n", - " \"etf\": row[\"etf\"],\n", - " \"market\": row[\"market\"],\n", - " \"rating\": row[\"rating\"],\n", - " \"overview\": row[\"overview\"],\n", - " \"analysis\": row[\"analysis\"],\n", - " }\n", - " for index, row in df.iterrows()\n", - "]" - ] - }, - { - "cell_type": "code", - "execution_count": 34, - "metadata": { - "id": "qCsM2KXbdYiv" - }, - "outputs": [], - "source": [ - "from google.cloud.alloydb.connector import AsyncConnector\n", - "\n", - "connector = AsyncConnector()\n", - "\n", - "# Create table and insert data\n", - "async def insert_data(pool):\n", - " async with pool.connect() as db_conn:\n", - " await db_conn.execute(sqlalchemy.text(\"CREATE EXTENSION IF NOT EXISTS vector;\"))\n", - " await db_conn.execute(create_table_cmd)\n", - " await db_conn.execute(\n", - " insert_data_cmd,\n", - " parameter_map,\n", - " )\n", - " await db_conn.commit()\n", - "\n", - "pool = await init_connection_pool(connector, database_name)\n", - "await insert_data(pool)\n", - "await pool.dispose()" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "IaC8uhlfEwam" - }, - "source": [ - "## Create the embeddings workflow\n", - "\n", - "The embeddings workflow contains four major parts:\n", - "1. Read the data\n", - "2. Batch the data\n", - "3. Generate embeddings\n", - "4. Update original table\n" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "oIk5GxbnFaE3" - }, - "source": [ - "#### Step 0: Configure Logging" - ] - }, - { - "cell_type": "code", - "execution_count": 5, - "metadata": { - "id": "wvYGGRRoFXl4" - }, - "outputs": [], - "source": [ - "import logging\n", - "import sys\n", - "\n", - "# Configure the root logger to output messages with INFO level or above\n", - "logging.basicConfig(level=logging.INFO, stream=sys.stdout, format='%(asctime)s[%(levelname)5s][%(name)14s] - %(message)s', datefmt='%H:%M:%S', force=True)" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "ekrEM22pJ2df" - }, - "source": [ - "#### Step 1: Read the data\n", - "\n", - "This code reads data from a database and yields it for further processing." - ] - }, - { - "cell_type": "code", - "execution_count": 6, - "metadata": { - "id": "IZgMik9XBW19" - }, - "outputs": [], - "source": [ - "from typing import AsyncIterator, List\n", - "from sqlalchemy import RowMapping\n", - "from sqlalchemy.ext.asyncio import AsyncEngine\n", - "\n", - "async def get_source_data(pool: AsyncEngine, embed_cols: List[str]) -> AsyncIterator[RowMapping]:\n", - " \"\"\"\n", - " Yields data in the form of:\n", - " {'id' : 'id1', 'col1': 'val1', 'col2': 'val2'}\n", - " where col1 and col2 are columns containing data to be embedded.\n", - " \"\"\"\n", - " logger = logging.getLogger('get_source_data')\n", - "\n", - " sql = f\"SELECT id, {', '.join(embed_cols)} FROM {table_name}\"\n", - " logger.info(f\"Running SQL query: {sql}\")\n", - " async with pool.connect() as conn:\n", - " async for row in await conn.stream(text(sql)):\n", - " logger.debug(f\"yielded row: {row._mapping['id']}\")\n", - " # Yield the row as a dictionary (RowMapping)\n", - " yield row._mapping" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "Kg54pvhjJ5kL" - }, - "source": [ - "#### Step 2: Batch the data\n", - "\n", - "This code defines a function called `batch_source_data` that takes database rows and groups them into batches based on a character count limit (max_char_count). This batching process is crucial for efficient embedding generation for these reasons:\n", - "\n", - "* **Resource Optimization:** Instead of sending numerous small requests, batching allows us to send fewer, larger requests. This significantly optimizes resource usage and potentially reduces API costs.\n", - "\n", - "* **Working Within API Limits:** The max_char_count limit ensures each batch stays within the API's acceptable input size, preventing issues with exceeding the maximum character limit.\n" - ] - }, - { - "cell_type": "code", - "execution_count": 7, - "metadata": { - "id": "76qq6G38CZfm" - }, - "outputs": [], - "source": [ - "from typing import Any, List\n", - "import asyncio\n", - "\n", - "async def batch_source_data(read_generator: AsyncIterator[RowMapping], embed_cols: List[str]) -> AsyncIterator[List[dict[str, Any]]]:\n", - " \"\"\"\n", - " Yields data in the form of:\n", - " [\n", - " {'id' : 'id1', 'col1': 'val1', 'col2': 'val2'},\n", - " ...\n", - " ]\n", - " where col1 and col2 are columns containing data to be embedded.\n", - " \"\"\"\n", - " logger = logging.getLogger('batch_data')\n", - "\n", - " global max_char_count\n", - "\n", - " batch = []\n", - " char_count = 0\n", - " batch_num = 0\n", - "\n", - " async for row in read_generator:\n", - " # Char count in current row\n", - " row_char_count = sum(len(row[col]) for col in embed_cols)\n", - "\n", - " if char_count + row_char_count > max_char_count:\n", - " batch_num += 1\n", - " logger.info(f\"yielded batch number: {batch_num} with length: {len(batch)}\")\n", - " yield batch\n", - " batch, char_count = [], 0\n", - "\n", - " # Add the current row to the batch\n", - " batch.append(row)\n", - " char_count += row_char_count\n", - "\n", - " if batch:\n", - " batch_num += 1\n", - " logger.info(f\"Yielded batch number: {batch_num} with length: {len(batch)}\")\n", - " yield batch" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "_L4EnrleJ8gy" - }, - "source": [ - "#### Step 3: Generate embeddings\n", - "\n", - "This step converts your text data into numerical representations called \"embeddings.\" These embeddings capture the meaning and relationships between words, making them useful for various tasks like search, recommendations, and clustering.\n", - "\n", - "The code uses two functions to efficiently generate embeddings:\n", - "\n", - "**embed_text**\n", - "\n", - "This function your text data and sends it to vertex AI, transforming the text in specific columns into embeddings.\n", - "\n", - "**embed_objects_concurrently**\n", - "\n", - "This function is the orchestrator. It manages the embedding generation process for multiple batches of text concurrently. This function ensures that all batches are processed efficiently without overwhelming the system." - ] - }, - { - "cell_type": "code", - "execution_count": 8, - "metadata": { - "id": "4OYdrJk9Co0v" - }, - "outputs": [], - "source": [ - "from google.api_core.exceptions import ResourceExhausted\n", - "from typing import Union\n", - "from vertexai.language_models import TextEmbeddingInput, TextEmbeddingModel\n", - "\n", - "async def embed_text(\n", - " logger: logging.Logger,\n", - " batch_data: List[dict[str, Any]],\n", - " model: TextEmbeddingModel,\n", - " cols_to_embed: List[str],\n", - " task_type: str = \"SEMANTIC_SIMILARITY\",\n", - " retries: int = 100,\n", - " delay: int = 1,\n", - ") -> List[dict[str, Union[List[float], str]]]:\n", - " \"\"\"\n", - " Returns data in the form of:\n", - " [\n", - " {\n", - " 'id': 'id1',\n", - " 'col1_embedding': [1.0, 1.1, ...],\n", - " 'col2_embedding': [2.0, 2.1, ...],\n", - " ...\n", - " },\n", - " ...\n", - " ]\n", - " where col1 and col2 are columns containing data to be embedded.\n", - " \"\"\"\n", - " global total_char_count\n", - "\n", - " # Place all of the embeddings into a single list\n", - " inputs = []\n", - " for data in batch_data:\n", - " inputs.extend(\n", - " TextEmbeddingInput(data[col], task_type) for col in cols_to_embed\n", - " )\n", - "\n", - " for attempt in range(retries): # Retry loop\n", - " try:\n", - " # Get embeddings for the text data\n", - " embeddings = await model.get_embeddings_async(inputs)\n", - "\n", - " # Increase total char count\n", - " total_char_count += sum([len(input.text) for input in inputs])\n", - "\n", - " # group the results together by id\n", - " embedding_iter = iter(embeddings)\n", - " results = []\n", - " for row in batch_data:\n", - " r = { 'id': row['id'] }\n", - " for col in cols_to_embed:\n", - " r[f'{col}_embedding'] = str(next(embedding_iter).values)\n", - " results.append(r)\n", - " return results\n", - "\n", - " except ResourceExhausted as e:\n", - " if attempt < retries - 1: # Retry only if attempts are left\n", - " logger.warning(f\"Error: {e}. Retrying in {delay} seconds...\")\n", - " await asyncio.sleep(delay) # Wait before retrying\n", - " else:\n", - " logger.error(f\"Failed to get embeddings after {retries} attempts.\")\n", - " raise # Raise the error if all retries fail\n", - "\n", - " return []\n", - "\n", - "async def embed_objects_concurrently(\n", - " cols_to_embed: List[str],\n", - " batch_data: AsyncIterator[List[dict[str, Any]]],\n", - " model: TextEmbeddingModel,\n", - " task_type: str,\n", - " max_concurrency: int = 5,\n", - ") -> AsyncIterator[List[dict[str, Union[str, List[float]]]]]:\n", - " \"\"\"\n", - " Embeds text from objects concurrently with a maximum concurrency limit. This\n", - " function processes batches of data concurrently, limiting the number of\n", - " simultaneous embedding tasks to improve efficiency and resource utilization.\n", - " \"\"\"\n", - " logger = logging.getLogger('embed_objects')\n", - " # Keep track of pending tasks\n", - " pending: set[asyncio.Task] = set()\n", - " has_next = True\n", - " while pending or has_next:\n", - " while len(pending) < max_concurrency and has_next:\n", - " try:\n", - " data = await batch_data.__anext__() \n", - " coro = embed_text(logger, data, model, cols_to_embed, task_type)\n", - " pending.add(asyncio.ensure_future(coro))\n", - " except StopAsyncIteration:\n", - " has_next = False\n", - "\n", - " done, pending = await asyncio.wait(\n", - " pending, return_when=asyncio.FIRST_COMPLETED\n", - " )\n", - "\n", - " for task in done:\n", - " result = task.result()\n", - " logger.info(f\"Embedding task completed: Processed {len(result)} rows.\")\n", - " yield result\n" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "FjErJPrJKA2j" - }, - "source": [ - "#### Step 4: Update original table\n", - "\n", - "After generating embeddings for your text data, you need to store them in your database. This step efficiently updates your original table with the newly created embeddings.\n", - "\n", - "This process uses two functions to manage database updates:\n", - "\n", - "**batch_update_rows**\n", - "1. This function takes a batch of data (including the embeddings) and updates the corresponding rows in your database table.\n", - "2. It constructs an SQL UPDATE query to modify specific columns with the embedding values.\n", - "3. It ensures that the updates are done efficiently and correctly within a database transaction.\n", - "\n", - "\n", - "**batch_update_rows_concurrently**\n", - "\n", - "1. This function handles the concurrent updating of multiple batches of data.\n", - "2. It creates multiple \"tasks\" that each execute the batch_update_rows function on a separate batch.\n", - "3. It limits the number of concurrent tasks to avoid overloading your database and system resources.\n", - "4. It manages the execution of these tasks, ensuring that all batches are processed efficiently." - ] - }, - { - "cell_type": "code", - "execution_count": 9, - "metadata": { - "id": "lEyvhlOCCr7F" - }, - "outputs": [], - "source": [ - "from sqlalchemy import text\n", - "\n", - "async def batch_update_rows(pool: AsyncEngine, logger: logging.Logger, data: List[dict[str, Any]], cols_to_embed: List[str]) -> None:\n", - " update_query = f\"\"\"\n", - " UPDATE {table_name}\n", - " SET {', '.join([f'{col}_embedding = :{col}_embedding' for col in cols_to_embed])}\n", - " WHERE id = :id;\n", - " \"\"\"\n", - "\n", - " async with pool.connect() as conn:\n", - " await conn.execute(\n", - " text(update_query),\n", - " # Create parameters for all rows in the data\n", - " parameters = data,\n", - " )\n", - " await conn.commit()\n", - " logger.info(f\"Updated {len(data)} rows in database.\")\n", - "\n", - "\n", - "async def batch_update_rows_concurrently(\n", - " pool: AsyncEngine,\n", - " embed_data: AsyncIterator[List[dict[str, Any]]],\n", - " cols_to_embed: List[str],\n", - " max_concurrency: int = 5\n", - "):\n", - " logger = logging.getLogger('update_rows')\n", - " # Keep track of pending tasks\n", - " pending: set[asyncio.Task] = set()\n", - " has_next = True\n", - " while pending or has_next:\n", - " while len(pending) < max_concurrency and has_next:\n", - " try:\n", - " data = await embed_data.__anext__() \n", - " coro = batch_update_rows(pool, logger, data, cols_to_embed)\n", - " pending.add(asyncio.ensure_future(coro))\n", - " except StopAsyncIteration:\n", - " has_next = False\n", - "\n", - " done, pending = await asyncio.wait(\n", - " pending, return_when=asyncio.FIRST_COMPLETED\n", - " )\n", - "\n", - " logger.info(\"All database update tasks completed.\")" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "HSv4DwzbJc5J" - }, - "source": [ - "## Run the embeddings workflow\n", - "\n" - ] - }, - { - "cell_type": "code", - "execution_count": 10, - "metadata": { - "id": "rWb1T9aIBWa-" - }, - "outputs": [], - "source": [ - "# Max token count for the embeddings API\n", - "max_tokens = 20000\n", - "\n", - "# For some tokenizers and text, there's a rough approximation that 1 token corresponds to about 3-4 characters. This is a very general guideline and can vary significantly.\n", - "max_char_count = max_tokens * 3\n", - "\n", - "cols_to_embed = ['analysis','overview']\n", - "\n", - "# Model to use for generating embeddings\n", - "model_name = 'text-embedding-004'\n", - "\n", - "# Generate optimised embeddings for a given task\n", - "# Ref: https://cloud.google.com/vertex-ai/generative-ai/docs/embeddings/task-types#supported_task_types\n", - "task = \"SEMANTIC_SIMILARITY\"" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "D3FUBaXIUquR" - }, - "source": [ - "This runs the complete embeddings workflow:\n", - "\n", - "1. Gettting source data\n", - "2. Batching source data\n", - "3. Generating embeddings for batches\n", - "4. Updating data batches in the original table" - ] - }, - { - "cell_type": "code", - "execution_count": 11, - "metadata": { - "colab": { - "base_uri": "https://localhost:8080/" - }, - "id": "syO1Zq3o5PnI", - "outputId": "8db5edfc-7b9e-46da-bda8-123444033b37" - }, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "16:20:21[ INFO][get_source_data] - Running SQL query: SELECT id, analysis, overview FROM investments\n", - "16:20:30[ INFO][ batch_data] - yielded batch number: 1 with length: 21\n", - "16:20:31[ INFO][ batch_data] - yielded batch number: 2 with length: 22\n", - "16:20:31[ INFO][ batch_data] - yielded batch number: 3 with length: 23\n", - "16:20:31[ INFO][ batch_data] - yielded batch number: 4 with length: 22\n", - "16:20:31[ INFO][ batch_data] - yielded batch number: 5 with length: 21\n", - "16:20:31[ INFO][ batch_data] - yielded batch number: 6 with length: 20\n", - "16:20:31[ INFO][ batch_data] - yielded batch number: 7 with length: 21\n", - "16:20:32[ INFO][ batch_data] - yielded batch number: 8 with length: 21\n", - "16:20:32[ INFO][ batch_data] - yielded batch number: 9 with length: 22\n", - "16:20:32[ INFO][ batch_data] - yielded batch number: 10 with length: 21\n", - "16:20:32[ INFO][ batch_data] - yielded batch number: 11 with length: 21\n", - "16:20:32[ INFO][ batch_data] - yielded batch number: 12 with length: 22\n", - "16:20:32[ INFO][ batch_data] - yielded batch number: 13 with length: 21\n", - "16:20:32[ INFO][ batch_data] - yielded batch number: 14 with length: 22\n", - "16:20:32[ INFO][ batch_data] - yielded batch number: 15 with length: 21\n", - "16:20:32[ INFO][ batch_data] - yielded batch number: 16 with length: 21\n", - "16:20:32[ INFO][ batch_data] - yielded batch number: 17 with length: 20\n", - "16:20:32[ INFO][ batch_data] - yielded batch number: 18 with length: 21\n", - "16:20:32[ INFO][ batch_data] - yielded batch number: 19 with length: 21\n", - "16:20:33[ INFO][ batch_data] - yielded batch number: 20 with length: 23\n", - "16:20:34[ INFO][ embed_objects] - Embedding task completed: Processed 21 rows.\n", - "16:20:34[ INFO][ batch_data] - yielded batch number: 21 with length: 21\n", - "16:20:34[ INFO][ embed_objects] - Embedding task completed: Processed 22 rows.\n", - "16:20:34[ INFO][ batch_data] - yielded batch number: 22 with length: 21\n", - "16:20:34[ INFO][ embed_objects] - Embedding task completed: Processed 22 rows.\n", - "16:20:34[ INFO][ batch_data] - yielded batch number: 23 with length: 21\n", - "16:20:34[ INFO][ embed_objects] - Embedding task completed: Processed 21 rows.\n", - "16:20:34[ INFO][ batch_data] - yielded batch number: 24 with length: 20\n", - "16:20:34[ INFO][ embed_objects] - Embedding task completed: Processed 22 rows.\n", - "16:20:34[ INFO][ batch_data] - yielded batch number: 25 with length: 22\n", - "16:20:34[ INFO][ embed_objects] - Embedding task completed: Processed 23 rows.\n", - "16:20:34[ INFO][ batch_data] - yielded batch number: 26 with length: 20\n", - "16:20:35[ INFO][ embed_objects] - Embedding task completed: Processed 21 rows.\n", - "16:20:35[ INFO][ batch_data] - yielded batch number: 27 with length: 21\n", - "16:20:35[ INFO][ embed_objects] - Embedding task completed: Processed 21 rows.\n", - "16:20:35[ INFO][ batch_data] - yielded batch number: 28 with length: 21\n", - "16:20:35[ INFO][ embed_objects] - Embedding task completed: Processed 20 rows.\n", - "16:20:35[ INFO][ batch_data] - yielded batch number: 29 with length: 20\n", - "16:20:35[ INFO][ embed_objects] - Embedding task completed: Processed 23 rows.\n", - "16:20:41[ INFO][ update_rows] - Updated 22 rows in database.\n", - "16:20:41[ INFO][ batch_data] - yielded batch number: 30 with length: 19\n", - "16:20:41[ INFO][ embed_objects] - Embedding task completed: Processed 21 rows.\n", - "16:20:41[ INFO][ update_rows] - Updated 21 rows in database.\n", - "16:20:41[ INFO][ embed_objects] - Embedding task completed: Processed 21 rows.\n", - "16:20:42[ INFO][ update_rows] - Updated 22 rows in database.\n", - "16:20:42[ INFO][ embed_objects] - Embedding task completed: Processed 22 rows.\n", - "16:20:42[ INFO][ update_rows] - Updated 23 rows in database.\n", - "16:20:42[ INFO][ embed_objects] - Embedding task completed: Processed 21 rows.\n", - "16:20:42[ INFO][ update_rows] - Updated 21 rows in database.\n", - "16:20:42[ INFO][ embed_objects] - Embedding task completed: Processed 20 rows.\n", - "16:20:42[ INFO][ update_rows] - Updated 21 rows in database.\n", - "16:20:42[ INFO][ embed_objects] - Embedding task completed: Processed 21 rows.\n", - "16:20:42[ INFO][ update_rows] - Updated 20 rows in database.\n", - "16:20:42[ INFO][ embed_objects] - Embedding task completed: Processed 20 rows.\n", - "16:20:42[ INFO][ update_rows] - Updated 22 rows in database.\n", - "16:20:42[ INFO][ embed_objects] - Embedding task completed: Processed 22 rows.\n", - "16:20:42[ INFO][ update_rows] - Updated 23 rows in database.\n", - "16:20:42[ INFO][ embed_objects] - Embedding task completed: Processed 22 rows.\n", - "16:20:42[ INFO][ update_rows] - Updated 21 rows in database.\n", - "16:20:42[ INFO][ embed_objects] - Embedding task completed: Processed 20 rows.\n", - "16:20:43[ INFO][ update_rows] - Updated 21 rows in database.\n", - "16:20:43[ INFO][ embed_objects] - Embedding task completed: Processed 21 rows.\n", - "16:20:43[ INFO][ update_rows] - Updated 21 rows in database.\n", - "16:20:43[ INFO][ embed_objects] - Embedding task completed: Processed 21 rows.\n", - "16:20:43[ INFO][ update_rows] - Updated 22 rows in database.\n", - "16:20:43[ INFO][ embed_objects] - Embedding task completed: Processed 21 rows.\n", - "16:20:43[ INFO][ update_rows] - Updated 22 rows in database.\n", - "16:20:43[ INFO][ embed_objects] - Embedding task completed: Processed 21 rows.\n", - "16:20:43[ INFO][ update_rows] - Updated 21 rows in database.\n", - "16:20:43[ INFO][ embed_objects] - Embedding task completed: Processed 21 rows.\n", - "16:20:43[ INFO][ update_rows] - Updated 20 rows in database.\n", - "16:20:43[ INFO][ embed_objects] - Embedding task completed: Processed 20 rows.\n", - "16:20:43[ INFO][ update_rows] - Updated 20 rows in database.\n", - "16:20:43[ INFO][ embed_objects] - Embedding task completed: Processed 21 rows.\n", - "16:20:43[ INFO][ update_rows] - Updated 20 rows in database.\n", - "16:20:43[ INFO][ embed_objects] - Embedding task completed: Processed 21 rows.\n", - "16:20:44[ INFO][ update_rows] - Updated 21 rows in database.\n", - "16:20:44[ INFO][ embed_objects] - Embedding task completed: Processed 21 rows.\n", - "16:20:44[ INFO][ update_rows] - Updated 22 rows in database.\n", - "16:20:44[ INFO][ batch_data] - yielded batch number: 31 with length: 21\n", - "16:20:44[ INFO][ batch_data] - yielded batch number: 32 with length: 21\n", - "16:20:44[ INFO][ batch_data] - yielded batch number: 33 with length: 22\n", - "16:20:44[ INFO][ batch_data] - yielded batch number: 34 with length: 21\n", - "16:20:44[ INFO][ batch_data] - yielded batch number: 35 with length: 21\n", - "16:20:44[ INFO][ batch_data] - yielded batch number: 36 with length: 21\n", - "16:20:44[ INFO][ update_rows] - Updated 21 rows in database.\n", - "16:20:44[ INFO][ update_rows] - Updated 21 rows in database.\n", - "16:20:44[ INFO][ update_rows] - Updated 21 rows in database.\n", - "16:20:44[ INFO][ update_rows] - Updated 21 rows in database.\n", - "16:20:45[ INFO][ batch_data] - yielded batch number: 37 with length: 21\n", - "16:20:45[ INFO][ batch_data] - yielded batch number: 38 with length: 20\n", - "16:20:45[ INFO][ batch_data] - yielded batch number: 39 with length: 22\n", - "16:20:45[ INFO][ batch_data] - yielded batch number: 40 with length: 20\n", - "16:20:45[ INFO][ batch_data] - yielded batch number: 41 with length: 21\n", - "16:20:45[ INFO][ batch_data] - yielded batch number: 42 with length: 22\n", - "16:20:45[ INFO][ batch_data] - yielded batch number: 43 with length: 22\n", - "16:20:45[ INFO][ batch_data] - yielded batch number: 44 with length: 21\n", - "16:20:45[ INFO][ batch_data] - yielded batch number: 45 with length: 23\n", - "16:20:45[ INFO][ batch_data] - yielded batch number: 46 with length: 21\n", - "16:20:45[ INFO][ batch_data] - yielded batch number: 47 with length: 22\n", - "16:20:45[ INFO][ batch_data] - yielded batch number: 48 with length: 22\n", - "16:20:45[ INFO][ batch_data] - yielded batch number: 49 with length: 21\n", - "16:20:45[ INFO][ update_rows] - Updated 21 rows in database.\n", - "16:20:45[ INFO][ update_rows] - Updated 21 rows in database.\n", - "16:20:45[ INFO][ embed_objects] - Embedding task completed: Processed 21 rows.\n", - "16:20:45[ INFO][ embed_objects] - Embedding task completed: Processed 21 rows.\n", - "16:20:45[ INFO][ embed_objects] - Embedding task completed: Processed 19 rows.\n", - "16:20:45[ INFO][ embed_objects] - Embedding task completed: Processed 21 rows.\n", - "16:20:45[ INFO][ embed_objects] - Embedding task completed: Processed 21 rows.\n", - "16:20:45[ INFO][ embed_objects] - Embedding task completed: Processed 22 rows.\n", - "16:20:45[ INFO][ embed_objects] - Embedding task completed: Processed 21 rows.\n", - "16:20:45[ INFO][ update_rows] - Updated 21 rows in database.\n", - "16:20:45[ INFO][ batch_data] - yielded batch number: 50 with length: 20\n", - "16:20:45[ INFO][ batch_data] - yielded batch number: 51 with length: 23\n", - "16:20:45[ INFO][ batch_data] - yielded batch number: 52 with length: 21\n", - "16:20:45[ INFO][ batch_data] - yielded batch number: 53 with length: 22\n", - "16:20:45[ INFO][ batch_data] - yielded batch number: 54 with length: 21\n", - "16:20:45[ INFO][ batch_data] - yielded batch number: 55 with length: 22\n", - "16:20:45[ INFO][ batch_data] - yielded batch number: 56 with length: 22\n", - "16:20:45[ INFO][ embed_objects] - Embedding task completed: Processed 20 rows.\n", - "16:20:46[ INFO][ update_rows] - Updated 21 rows in database.\n", - "16:20:46[ INFO][ update_rows] - Updated 21 rows in database.\n", - "16:20:46[ INFO][ batch_data] - yielded batch number: 57 with length: 21\n", - "16:20:46[ INFO][ embed_objects] - Embedding task completed: Processed 22 rows.\n", - "16:20:46[ INFO][ embed_objects] - Embedding task completed: Processed 22 rows.\n", - "16:20:46[ INFO][ update_rows] - Updated 20 rows in database.\n", - "16:20:46[ INFO][ update_rows] - Updated 21 rows in database.\n", - "16:20:46[ INFO][ embed_objects] - Embedding task completed: Processed 21 rows.\n", - "16:20:46[ INFO][ embed_objects] - Embedding task completed: Processed 22 rows.\n", - "16:20:47[ INFO][ update_rows] - Updated 22 rows in database.\n", - "16:20:47[ INFO][ embed_objects] - Embedding task completed: Processed 21 rows.\n", - "16:20:47[ INFO][ update_rows] - Updated 19 rows in database.\n", - "16:20:47[ INFO][ embed_objects] - Embedding task completed: Processed 21 rows.\n", - "16:20:47[ INFO][ update_rows] - Updated 22 rows in database.\n", - "16:20:47[ INFO][ embed_objects] - Embedding task completed: Processed 21 rows.\n", - "16:20:47[ INFO][ update_rows] - Updated 21 rows in database.\n", - "16:20:47[ INFO][ embed_objects] - Embedding task completed: Processed 20 rows.\n", - "16:20:47[ INFO][ update_rows] - Updated 22 rows in database.\n", - "16:20:47[ INFO][ embed_objects] - Embedding task completed: Processed 21 rows.\n", - "16:20:48[ INFO][ update_rows] - Updated 22 rows in database.\n", - "16:20:48[ INFO][ embed_objects] - Embedding task completed: Processed 22 rows.\n", - "16:20:48[ INFO][ update_rows] - Updated 21 rows in database.\n", - "16:20:48[ INFO][ embed_objects] - Embedding task completed: Processed 22 rows.\n", - "16:20:48[ INFO][ update_rows] - Updated 21 rows in database.\n", - "16:20:48[ INFO][ embed_objects] - Embedding task completed: Processed 23 rows.\n", - "16:20:48[ INFO][ update_rows] - Updated 21 rows in database.\n", - "16:20:48[ INFO][ batch_data] - yielded batch number: 58 with length: 21\n", - "16:20:48[ INFO][ batch_data] - yielded batch number: 59 with length: 22\n", - "16:20:48[ INFO][ batch_data] - yielded batch number: 60 with length: 22\n", - "16:20:48[ INFO][ batch_data] - yielded batch number: 61 with length: 21\n", - "16:20:48[ INFO][ batch_data] - yielded batch number: 62 with length: 21\n", - "16:20:48[ INFO][ batch_data] - yielded batch number: 63 with length: 23\n", - "16:20:48[ INFO][ batch_data] - yielded batch number: 64 with length: 19\n", - "16:20:48[ INFO][ batch_data] - yielded batch number: 65 with length: 19\n", - "16:20:48[ INFO][ batch_data] - yielded batch number: 66 with length: 20\n", - "16:20:48[ INFO][ batch_data] - yielded batch number: 67 with length: 21\n", - "16:20:48[ INFO][ batch_data] - yielded batch number: 68 with length: 21\n", - "16:20:48[ INFO][ batch_data] - yielded batch number: 69 with length: 20\n", - "16:20:48[ INFO][ embed_objects] - Embedding task completed: Processed 22 rows.\n", - "16:20:48[ INFO][ update_rows] - Updated 20 rows in database.\n", - "16:20:48[ INFO][ embed_objects] - Embedding task completed: Processed 23 rows.\n", - "16:20:48[ INFO][ update_rows] - Updated 21 rows in database.\n", - "16:20:48[ INFO][ embed_objects] - Embedding task completed: Processed 20 rows.\n", - "16:20:49[ INFO][ update_rows] - Updated 21 rows in database.\n", - "16:20:49[ INFO][ embed_objects] - Embedding task completed: Processed 21 rows.\n", - "16:20:49[ INFO][ update_rows] - Updated 22 rows in database.\n", - "16:20:49[ INFO][ embed_objects] - Embedding task completed: Processed 22 rows.\n", - "16:20:49[ INFO][ update_rows] - Updated 23 rows in database.\n", - "16:20:49[ INFO][ embed_objects] - Embedding task completed: Processed 22 rows.\n", - "16:20:50[ INFO][ update_rows] - Updated 22 rows in database.\n", - "16:20:50[ INFO][ embed_objects] - Embedding task completed: Processed 21 rows.\n", - "16:20:50[ INFO][ update_rows] - Updated 23 rows in database.\n", - "16:20:50[ INFO][ embed_objects] - Embedding task completed: Processed 21 rows.\n", - "16:20:50[ INFO][ update_rows] - Updated 22 rows in database.\n", - "16:20:50[ INFO][ batch_data] - yielded batch number: 70 with length: 21\n", - "16:20:50[ INFO][ batch_data] - yielded batch number: 71 with length: 21\n", - "16:20:50[ INFO][ batch_data] - yielded batch number: 72 with length: 21\n", - "16:20:50[ INFO][ batch_data] - yielded batch number: 73 with length: 20\n", - "16:20:50[ INFO][ batch_data] - yielded batch number: 74 with length: 21\n", - "16:20:50[ INFO][ batch_data] - yielded batch number: 75 with length: 21\n", - "16:20:50[ INFO][ batch_data] - yielded batch number: 76 with length: 21\n", - "16:20:50[ INFO][ batch_data] - yielded batch number: 77 with length: 22\n", - "16:20:50[ INFO][ embed_objects] - Embedding task completed: Processed 20 rows.\n", - "16:20:50[ INFO][ update_rows] - Updated 20 rows in database.\n", - "16:20:50[ INFO][ embed_objects] - Embedding task completed: Processed 21 rows.\n", - "16:20:50[ INFO][ update_rows] - Updated 20 rows in database.\n", - "16:20:50[ INFO][ embed_objects] - Embedding task completed: Processed 21 rows.\n", - "16:20:50[ INFO][ update_rows] - Updated 21 rows in database.\n", - "16:20:50[ INFO][ embed_objects] - Embedding task completed: Processed 21 rows.\n", - "16:20:50[ INFO][ update_rows] - Updated 22 rows in database.\n", - "16:20:50[ INFO][ embed_objects] - Embedding task completed: Processed 23 rows.\n", - "16:20:50[ INFO][ update_rows] - Updated 22 rows in database.\n", - "16:20:50[ INFO][ embed_objects] - Embedding task completed: Processed 22 rows.\n", - "16:20:51[ INFO][ update_rows] - Updated 20 rows in database.\n", - "16:20:51[ INFO][ embed_objects] - Embedding task completed: Processed 22 rows.\n", - "16:20:51[ INFO][ update_rows] - Updated 21 rows in database.\n", - "16:20:51[ INFO][ embed_objects] - Embedding task completed: Processed 20 rows.\n", - "16:20:51[ INFO][ update_rows] - Updated 21 rows in database.\n", - "16:20:51[ INFO][ embed_objects] - Embedding task completed: Processed 19 rows.\n", - "16:20:51[ INFO][ update_rows] - Updated 23 rows in database.\n", - "16:20:51[ INFO][ embed_objects] - Embedding task completed: Processed 21 rows.\n", - "16:20:52[ INFO][ update_rows] - Updated 21 rows in database.\n", - "16:20:52[ INFO][ embed_objects] - Embedding task completed: Processed 19 rows.\n", - "16:20:52[ INFO][ update_rows] - Updated 21 rows in database.\n", - "16:20:52[ INFO][ embed_objects] - Embedding task completed: Processed 21 rows.\n", - "16:20:52[ INFO][ update_rows] - Updated 21 rows in database.\n", - "16:20:52[ INFO][ batch_data] - yielded batch number: 78 with length: 21\n", - "16:20:52[ INFO][ batch_data] - yielded batch number: 79 with length: 20\n", - "16:20:52[ INFO][ batch_data] - yielded batch number: 80 with length: 20\n", - "16:20:52[ INFO][ batch_data] - yielded batch number: 81 with length: 21\n", - "16:20:52[ INFO][ batch_data] - yielded batch number: 82 with length: 20\n", - "16:20:52[ INFO][ batch_data] - yielded batch number: 83 with length: 20\n", - "16:20:52[ INFO][ batch_data] - yielded batch number: 84 with length: 22\n", - "16:20:52[ INFO][ update_rows] - Updated 21 rows in database.\n", - "16:20:52[ INFO][ update_rows] - Updated 22 rows in database.\n", - "16:20:52[ INFO][ update_rows] - Updated 19 rows in database.\n", - "16:20:52[ INFO][ update_rows] - Updated 21 rows in database.\n", - "16:20:53[ INFO][ batch_data] - yielded batch number: 85 with length: 20\n", - "16:20:53[ INFO][ batch_data] - yielded batch number: 86 with length: 21\n", - "16:20:53[ INFO][ batch_data] - yielded batch number: 87 with length: 20\n", - "16:20:53[ INFO][ batch_data] - yielded batch number: 88 with length: 21\n", - "16:20:53[ INFO][ batch_data] - yielded batch number: 89 with length: 21\n", - "16:20:53[ INFO][ embed_objects] - Embedding task completed: Processed 20 rows.\n", - "16:20:53[ INFO][ embed_objects] - Embedding task completed: Processed 22 rows.\n", - "16:20:53[ INFO][ embed_objects] - Embedding task completed: Processed 20 rows.\n", - "16:20:53[ INFO][ embed_objects] - Embedding task completed: Processed 21 rows.\n", - "16:20:53[ INFO][ embed_objects] - Embedding task completed: Processed 20 rows.\n", - "16:20:53[ INFO][ update_rows] - Updated 21 rows in database.\n", - "16:20:53[ INFO][ embed_objects] - Embedding task completed: Processed 21 rows.\n", - "16:20:53[ INFO][ update_rows] - Updated 21 rows in database.\n", - "16:20:53[ INFO][ embed_objects] - Embedding task completed: Processed 21 rows.\n", - "16:20:53[ INFO][ update_rows] - Updated 19 rows in database.\n", - "16:20:53[ INFO][ embed_objects] - Embedding task completed: Processed 21 rows.\n", - "16:20:54[ INFO][ update_rows] - Updated 20 rows in database.\n", - "16:20:54[ INFO][ embed_objects] - Embedding task completed: Processed 21 rows.\n", - "16:20:54[ INFO][ update_rows] - Updated 20 rows in database.\n", - "16:20:54[ INFO][ embed_objects] - Embedding task completed: Processed 20 rows.\n", - "16:20:54[ INFO][ update_rows] - Updated 21 rows in database.\n", - "16:20:54[ INFO][ embed_objects] - Embedding task completed: Processed 21 rows.\n", - "16:20:54[ INFO][ update_rows] - Updated 22 rows in database.\n", - "16:20:54[ INFO][ embed_objects] - Embedding task completed: Processed 21 rows.\n", - "16:20:55[ INFO][ update_rows] - Updated 20 rows in database.\n", - "16:20:55[ INFO][ batch_data] - yielded batch number: 90 with length: 20\n", - "16:20:55[ INFO][ batch_data] - yielded batch number: 91 with length: 20\n", - "16:20:55[ INFO][ batch_data] - yielded batch number: 92 with length: 20\n", - "16:20:55[ INFO][ batch_data] - yielded batch number: 93 with length: 22\n", - "16:20:55[ INFO][ batch_data] - yielded batch number: 94 with length: 20\n", - "16:20:55[ INFO][ batch_data] - yielded batch number: 95 with length: 21\n", - "16:20:55[ INFO][ batch_data] - yielded batch number: 96 with length: 21\n", - "16:20:55[ INFO][ batch_data] - yielded batch number: 97 with length: 21\n", - "16:20:55[ INFO][ batch_data] - yielded batch number: 98 with length: 20\n", - "16:20:55[ INFO][ batch_data] - yielded batch number: 99 with length: 21\n", - "16:20:55[ INFO][ batch_data] - yielded batch number: 100 with length: 22\n", - "16:20:55[ INFO][ batch_data] - yielded batch number: 101 with length: 21\n", - "16:20:55[ INFO][ embed_objects] - Embedding task completed: Processed 20 rows.\n", - "16:20:55[ INFO][ update_rows] - Updated 20 rows in database.\n", - "16:20:55[ INFO][ embed_objects] - Embedding task completed: Processed 21 rows.\n", - "16:20:55[ INFO][ update_rows] - Updated 21 rows in database.\n", - "16:20:55[ INFO][ embed_objects] - Embedding task completed: Processed 21 rows.\n", - "16:20:55[ INFO][ update_rows] - Updated 21 rows in database.\n", - "16:20:55[ INFO][ embed_objects] - Embedding task completed: Processed 21 rows.\n", - "16:20:55[ INFO][ update_rows] - Updated 20 rows in database.\n", - "16:20:55[ INFO][ embed_objects] - Embedding task completed: Processed 22 rows.\n", - "16:20:55[ INFO][ update_rows] - Updated 21 rows in database.\n", - "16:20:55[ INFO][ embed_objects] - Embedding task completed: Processed 20 rows.\n", - "16:20:55[ INFO][ update_rows] - Updated 22 rows in database.\n", - "16:20:55[ INFO][ embed_objects] - Embedding task completed: Processed 21 rows.\n", - "16:20:55[ INFO][ update_rows] - Updated 21 rows in database.\n", - "16:20:55[ INFO][ embed_objects] - Embedding task completed: Processed 20 rows.\n", - "16:20:56[ INFO][ update_rows] - Updated 21 rows in database.\n", - "16:20:56[ INFO][ batch_data] - yielded batch number: 102 with length: 21\n", - "16:20:56[ INFO][ batch_data] - yielded batch number: 103 with length: 21\n", - "16:20:56[ INFO][ batch_data] - yielded batch number: 104 with length: 20\n", - "16:20:56[ INFO][ batch_data] - yielded batch number: 105 with length: 21\n", - "16:20:56[ INFO][ batch_data] - yielded batch number: 106 with length: 22\n", - "16:20:56[ INFO][ batch_data] - yielded batch number: 107 with length: 20\n", - "16:20:56[ INFO][ batch_data] - yielded batch number: 108 with length: 21\n", - "16:20:56[ INFO][ batch_data] - yielded batch number: 109 with length: 23\n", - "16:20:56[ INFO][ embed_objects] - Embedding task completed: Processed 21 rows.\n", - "16:20:56[ INFO][ update_rows] - Updated 20 rows in database.\n", - "16:20:56[ INFO][ embed_objects] - Embedding task completed: Processed 20 rows.\n", - "16:20:56[ INFO][ update_rows] - Updated 21 rows in database.\n", - "16:20:56[ INFO][ update_rows] - Updated 21 rows in database.\n", - "16:20:56[ INFO][ embed_objects] - Embedding task completed: Processed 20 rows.\n", - "16:20:56[ INFO][ embed_objects] - Embedding task completed: Processed 21 rows.\n", - "16:20:56[ INFO][ update_rows] - Updated 22 rows in database.\n", - "16:20:56[ INFO][ embed_objects] - Embedding task completed: Processed 20 rows.\n", - "16:20:56[ INFO][ update_rows] - Updated 21 rows in database.\n", - "16:20:56[ INFO][ embed_objects] - Embedding task completed: Processed 20 rows.\n", - "16:20:57[ INFO][ update_rows] - Updated 20 rows in database.\n", - "16:20:57[ INFO][ embed_objects] - Embedding task completed: Processed 20 rows.\n", - "16:20:57[ INFO][ update_rows] - Updated 21 rows in database.\n", - "16:20:57[ INFO][ embed_objects] - Embedding task completed: Processed 22 rows.\n", - "16:20:57[ INFO][ update_rows] - Updated 20 rows in database.\n", - "16:20:57[ INFO][ embed_objects] - Embedding task completed: Processed 21 rows.\n", - "16:20:57[ INFO][ update_rows] - Updated 20 rows in database.\n", - "16:20:57[ INFO][ embed_objects] - Embedding task completed: Processed 21 rows.\n", - "16:20:58[ INFO][ update_rows] - Updated 21 rows in database.\n", - "16:20:58[ INFO][ embed_objects] - Embedding task completed: Processed 21 rows.\n", - "16:20:58[ INFO][ update_rows] - Updated 20 rows in database.\n", - "16:20:58[ INFO][ embed_objects] - Embedding task completed: Processed 22 rows.\n", - "16:20:58[ INFO][ update_rows] - Updated 21 rows in database.\n", - "16:20:58[ INFO][ batch_data] - yielded batch number: 110 with length: 20\n", - "16:20:58[ INFO][ batch_data] - yielded batch number: 111 with length: 22\n", - "16:20:58[ INFO][ batch_data] - yielded batch number: 112 with length: 21\n", - "16:20:58[ INFO][ batch_data] - yielded batch number: 113 with length: 22\n", - "16:20:58[ INFO][ batch_data] - yielded batch number: 114 with length: 20\n", - "16:20:58[ INFO][ batch_data] - yielded batch number: 115 with length: 22\n", - "16:20:58[ INFO][ batch_data] - yielded batch number: 116 with length: 21\n", - "16:20:58[ INFO][ batch_data] - yielded batch number: 117 with length: 20\n", - "16:20:58[ INFO][ batch_data] - yielded batch number: 118 with length: 20\n", - "16:20:58[ INFO][ batch_data] - yielded batch number: 119 with length: 21\n", - "16:20:58[ INFO][ batch_data] - yielded batch number: 120 with length: 21\n", - "16:20:58[ INFO][ batch_data] - yielded batch number: 121 with length: 21\n", - "16:20:58[ INFO][ embed_objects] - Embedding task completed: Processed 21 rows.\n", - "16:20:59[ INFO][ update_rows] - Updated 21 rows in database.\n", - "16:20:59[ INFO][ update_rows] - Updated 20 rows in database.\n", - "16:20:59[ INFO][ embed_objects] - Embedding task completed: Processed 23 rows.\n", - "16:20:59[ INFO][ embed_objects] - Embedding task completed: Processed 21 rows.\n", - "16:20:59[ INFO][ update_rows] - Updated 20 rows in database.\n", - "16:20:59[ INFO][ update_rows] - Updated 21 rows in database.\n", - "16:20:59[ INFO][ update_rows] - Updated 21 rows in database.\n", - "16:20:59[ INFO][ embed_objects] - Embedding task completed: Processed 20 rows.\n", - "16:20:59[ INFO][ embed_objects] - Embedding task completed: Processed 22 rows.\n", - "16:20:59[ INFO][ embed_objects] - Embedding task completed: Processed 21 rows.\n", - "16:20:59[ INFO][ update_rows] - Updated 22 rows in database.\n", - "16:20:59[ INFO][ embed_objects] - Embedding task completed: Processed 20 rows.\n", - "16:20:59[ INFO][ update_rows] - Updated 21 rows in database.\n", - "16:20:59[ INFO][ embed_objects] - Embedding task completed: Processed 21 rows.\n", - "16:21:00[ INFO][ update_rows] - Updated 21 rows in database.\n", - "16:21:00[ INFO][ batch_data] - yielded batch number: 122 with length: 21\n", - "16:21:00[ INFO][ batch_data] - yielded batch number: 123 with length: 20\n", - "16:21:00[ INFO][ batch_data] - yielded batch number: 124 with length: 20\n", - "16:21:00[ INFO][ batch_data] - yielded batch number: 125 with length: 20\n", - "16:21:00[ INFO][ batch_data] - yielded batch number: 126 with length: 21\n", - "16:21:00[ INFO][ batch_data] - yielded batch number: 127 with length: 20\n", - "16:21:00[ INFO][ batch_data] - yielded batch number: 128 with length: 22\n", - "16:21:00[ INFO][ batch_data] - yielded batch number: 129 with length: 21\n", - "16:21:00[ INFO][ embed_objects] - Embedding task completed: Processed 21 rows.\n", - "16:21:00[ INFO][ update_rows] - Updated 20 rows in database.\n", - "16:21:00[ INFO][ embed_objects] - Embedding task completed: Processed 22 rows.\n", - "16:21:00[ INFO][ update_rows] - Updated 20 rows in database.\n", - "16:21:00[ INFO][ embed_objects] - Embedding task completed: Processed 21 rows.\n", - "16:21:00[ INFO][ update_rows] - Updated 23 rows in database.\n", - "16:21:00[ INFO][ embed_objects] - Embedding task completed: Processed 20 rows.\n", - "16:21:00[ INFO][ update_rows] - Updated 21 rows in database.\n", - "16:21:00[ INFO][ embed_objects] - Embedding task completed: Processed 21 rows.\n", - "16:21:00[ INFO][ update_rows] - Updated 20 rows in database.\n", - "16:21:00[ INFO][ embed_objects] - Embedding task completed: Processed 20 rows.\n", - "16:21:01[ INFO][ update_rows] - Updated 21 rows in database.\n", - "16:21:01[ INFO][ embed_objects] - Embedding task completed: Processed 21 rows.\n", - "16:21:02[ INFO][ update_rows] - Updated 21 rows in database.\n", - "16:21:02[ INFO][ embed_objects] - Embedding task completed: Processed 21 rows.\n", - "16:21:02[ INFO][ update_rows] - Updated 20 rows in database.\n", - "16:21:02[ INFO][ update_rows] - Updated 21 rows in database.\n", - "16:21:02[ INFO][ embed_objects] - Embedding task completed: Processed 20 rows.\n", - "16:21:02[ INFO][ embed_objects] - Embedding task completed: Processed 20 rows.\n", - "16:21:02[ INFO][ update_rows] - Updated 22 rows in database.\n", - "16:21:02[ INFO][ embed_objects] - Embedding task completed: Processed 22 rows.\n", - "16:21:02[ INFO][ update_rows] - Updated 22 rows in database.\n", - "16:21:02[ INFO][ embed_objects] - Embedding task completed: Processed 22 rows.\n", - "16:21:02[ INFO][ update_rows] - Updated 20 rows in database.\n", - "16:21:02[ INFO][ batch_data] - yielded batch number: 130 with length: 21\n", - "16:21:02[ INFO][ batch_data] - yielded batch number: 131 with length: 21\n", - "16:21:02[ INFO][ batch_data] - yielded batch number: 132 with length: 21\n", - "16:21:02[ INFO][ update_rows] - Updated 21 rows in database.\n", - "16:21:02[ INFO][ update_rows] - Updated 21 rows in database.\n", - "16:21:03[ INFO][ update_rows] - Updated 21 rows in database.\n", - "16:21:03[ INFO][ update_rows] - Updated 20 rows in database.\n", - "16:21:03[ INFO][ batch_data] - yielded batch number: 133 with length: 22\n", - "16:21:03[ INFO][ batch_data] - yielded batch number: 134 with length: 20\n", - "16:21:03[ INFO][ batch_data] - yielded batch number: 135 with length: 22\n", - "16:21:03[ INFO][ batch_data] - yielded batch number: 136 with length: 21\n", - "16:21:03[ INFO][ batch_data] - yielded batch number: 137 with length: 22\n", - "16:21:03[ INFO][ batch_data] - yielded batch number: 138 with length: 21\n", - "16:21:03[ INFO][ batch_data] - yielded batch number: 139 with length: 20\n", - "16:21:03[ INFO][ batch_data] - yielded batch number: 140 with length: 21\n", - "16:21:03[ INFO][ batch_data] - yielded batch number: 141 with length: 20\n", - "16:21:03[ INFO][ embed_objects] - Embedding task completed: Processed 22 rows.\n", - "16:21:03[ INFO][ embed_objects] - Embedding task completed: Processed 20 rows.\n", - "16:21:03[ INFO][ embed_objects] - Embedding task completed: Processed 21 rows.\n", - "16:21:03[ INFO][ embed_objects] - Embedding task completed: Processed 20 rows.\n", - "16:21:03[ INFO][ embed_objects] - Embedding task completed: Processed 20 rows.\n", - "16:21:03[ INFO][ update_rows] - Updated 22 rows in database.\n", - "16:21:03[ INFO][ embed_objects] - Embedding task completed: Processed 21 rows.\n", - "16:21:03[ INFO][ update_rows] - Updated 20 rows in database.\n", - "16:21:03[ INFO][ embed_objects] - Embedding task completed: Processed 21 rows.\n", - "16:21:03[ INFO][ update_rows] - Updated 22 rows in database.\n", - "16:21:03[ INFO][ embed_objects] - Embedding task completed: Processed 20 rows.\n", - "16:21:03[ INFO][ update_rows] - Updated 22 rows in database.\n", - "16:21:03[ INFO][ embed_objects] - Embedding task completed: Processed 21 rows.\n", - "16:21:04[ INFO][ update_rows] - Updated 22 rows in database.\n", - "16:21:04[ INFO][ batch_data] - yielded batch number: 142 with length: 21\n", - "16:21:04[ INFO][ batch_data] - yielded batch number: 143 with length: 23\n", - "16:21:04[ INFO][ batch_data] - yielded batch number: 144 with length: 22\n", - "16:21:04[ INFO][ batch_data] - yielded batch number: 145 with length: 21\n", - "16:21:04[ INFO][ batch_data] - yielded batch number: 146 with length: 21\n", - "16:21:04[ INFO][ batch_data] - yielded batch number: 147 with length: 21\n", - "16:21:04[ INFO][ batch_data] - yielded batch number: 148 with length: 22\n", - "16:21:04[ INFO][ batch_data] - yielded batch number: 149 with length: 21\n", - "16:21:04[ INFO][ batch_data] - yielded batch number: 150 with length: 21\n", - "16:21:04[ INFO][ embed_objects] - Embedding task completed: Processed 22 rows.\n", - "16:21:04[ INFO][ update_rows] - Updated 20 rows in database.\n", - "16:21:04[ INFO][ embed_objects] - Embedding task completed: Processed 22 rows.\n", - "16:21:04[ INFO][ update_rows] - Updated 20 rows in database.\n", - "16:21:04[ INFO][ embed_objects] - Embedding task completed: Processed 20 rows.\n", - "16:21:05[ INFO][ update_rows] - Updated 21 rows in database.\n", - "16:21:05[ INFO][ embed_objects] - Embedding task completed: Processed 22 rows.\n", - "16:21:05[ INFO][ update_rows] - Updated 21 rows in database.\n", - "16:21:05[ INFO][ embed_objects] - Embedding task completed: Processed 21 rows.\n", - "16:21:05[ INFO][ update_rows] - Updated 21 rows in database.\n", - "16:21:05[ INFO][ embed_objects] - Embedding task completed: Processed 21 rows.\n", - "16:21:05[ INFO][ update_rows] - Updated 22 rows in database.\n", - "16:21:05[ INFO][ embed_objects] - Embedding task completed: Processed 21 rows.\n", - "16:21:05[ INFO][ update_rows] - Updated 20 rows in database.\n", - "16:21:05[ INFO][ embed_objects] - Embedding task completed: Processed 21 rows.\n", - "16:21:05[ INFO][ update_rows] - Updated 21 rows in database.\n", - "16:21:05[ INFO][ embed_objects] - Embedding task completed: Processed 20 rows.\n", - "16:21:06[ INFO][ update_rows] - Updated 20 rows in database.\n", - "16:21:06[ INFO][ embed_objects] - Embedding task completed: Processed 20 rows.\n", - "16:21:06[ INFO][ update_rows] - Updated 21 rows in database.\n", - "16:21:06[ INFO][ embed_objects] - Embedding task completed: Processed 21 rows.\n", - "16:21:06[ INFO][ update_rows] - Updated 22 rows in database.\n", - "16:21:06[ INFO][ batch_data] - yielded batch number: 151 with length: 22\n", - "16:21:06[ INFO][ batch_data] - yielded batch number: 152 with length: 23\n", - "16:21:06[ INFO][ batch_data] - yielded batch number: 153 with length: 20\n", - "16:21:06[ INFO][ batch_data] - yielded batch number: 154 with length: 21\n", - "16:21:06[ INFO][ batch_data] - yielded batch number: 155 with length: 21\n", - "16:21:06[ INFO][ batch_data] - yielded batch number: 156 with length: 21\n", - "16:21:06[ INFO][ batch_data] - yielded batch number: 157 with length: 22\n", - "16:21:06[ INFO][ batch_data] - yielded batch number: 158 with length: 21\n", - "16:21:06[ INFO][ batch_data] - yielded batch number: 159 with length: 20\n", - "16:21:06[ INFO][ batch_data] - yielded batch number: 160 with length: 21\n", - "16:21:06[ INFO][ batch_data] - yielded batch number: 161 with length: 21\n", - "16:21:06[ INFO][ embed_objects] - Embedding task completed: Processed 23 rows.\n", - "16:21:06[ INFO][ update_rows] - Updated 21 rows in database.\n", - "16:21:06[ INFO][ embed_objects] - Embedding task completed: Processed 21 rows.\n", - "16:21:06[ INFO][ update_rows] - Updated 21 rows in database.\n", - "16:21:06[ INFO][ embed_objects] - Embedding task completed: Processed 21 rows.\n", - "16:21:07[ INFO][ update_rows] - Updated 20 rows in database.\n", - "16:21:07[ INFO][ embed_objects] - Embedding task completed: Processed 21 rows.\n", - "16:21:07[ INFO][ update_rows] - Updated 21 rows in database.\n", - "16:21:07[ INFO][ embed_objects] - Embedding task completed: Processed 22 rows.\n", - "16:21:07[ INFO][ update_rows] - Updated 21 rows in database.\n", - "16:21:07[ INFO][ embed_objects] - Embedding task completed: Processed 21 rows.\n", - "16:21:07[ INFO][ update_rows] - Updated 20 rows in database.\n", - "16:21:07[ INFO][ embed_objects] - Embedding task completed: Processed 22 rows.\n", - "16:21:08[ INFO][ update_rows] - Updated 22 rows in database.\n", - "16:21:08[ INFO][ embed_objects] - Embedding task completed: Processed 21 rows.\n", - "16:21:08[ INFO][ update_rows] - Updated 23 rows in database.\n", - "16:21:08[ INFO][ embed_objects] - Embedding task completed: Processed 21 rows.\n", - "16:21:08[ INFO][ update_rows] - Updated 21 rows in database.\n", - "16:21:08[ INFO][ batch_data] - yielded batch number: 162 with length: 21\n", - "16:21:08[ INFO][ batch_data] - yielded batch number: 163 with length: 22\n", - "16:21:08[ INFO][ batch_data] - yielded batch number: 164 with length: 22\n", - "16:21:08[ INFO][ batch_data] - yielded batch number: 165 with length: 22\n", - "16:21:08[ INFO][ batch_data] - yielded batch number: 166 with length: 22\n", - "16:21:08[ INFO][ batch_data] - yielded batch number: 167 with length: 21\n", - "16:21:08[ INFO][ batch_data] - yielded batch number: 168 with length: 21\n", - "16:21:08[ INFO][ batch_data] - yielded batch number: 169 with length: 22\n", - "16:21:08[ INFO][ batch_data] - yielded batch number: 170 with length: 21\n", - "16:21:08[ INFO][ embed_objects] - Embedding task completed: Processed 21 rows.\n", - "16:21:08[ INFO][ update_rows] - Updated 21 rows in database.\n", - "16:21:08[ INFO][ embed_objects] - Embedding task completed: Processed 22 rows.\n", - "16:21:08[ INFO][ update_rows] - Updated 21 rows in database.\n", - "16:21:08[ INFO][ embed_objects] - Embedding task completed: Processed 20 rows.\n", - "16:21:08[ INFO][ update_rows] - Updated 22 rows in database.\n", - "16:21:08[ INFO][ embed_objects] - Embedding task completed: Processed 21 rows.\n", - "16:21:09[ INFO][ update_rows] - Updated 20 rows in database.\n", - "16:21:09[ INFO][ update_rows] - Updated 21 rows in database.\n", - "16:21:09[ INFO][ update_rows] - Updated 21 rows in database.\n", - "16:21:09[ INFO][ embed_objects] - Embedding task completed: Processed 21 rows.\n", - "16:21:09[ INFO][ embed_objects] - Embedding task completed: Processed 21 rows.\n", - "16:21:09[ INFO][ embed_objects] - Embedding task completed: Processed 23 rows.\n", - "16:21:09[ INFO][ update_rows] - Updated 21 rows in database.\n", - "16:21:09[ INFO][ embed_objects] - Embedding task completed: Processed 21 rows.\n", - "16:21:09[ INFO][ update_rows] - Updated 21 rows in database.\n", - "16:21:09[ INFO][ embed_objects] - Embedding task completed: Processed 21 rows.\n", - "16:21:09[ INFO][ update_rows] - Updated 22 rows in database.\n", - "16:21:09[ INFO][ embed_objects] - Embedding task completed: Processed 20 rows.\n", - "16:21:09[ INFO][ update_rows] - Updated 22 rows in database.\n", - "16:21:09[ INFO][ embed_objects] - Embedding task completed: Processed 22 rows.\n", - "16:21:09[ INFO][ update_rows] - Updated 21 rows in database.\n", - "16:21:09[ INFO][ batch_data] - yielded batch number: 171 with length: 21\n", - "16:21:09[ INFO][ batch_data] - yielded batch number: 172 with length: 21\n", - "16:21:09[ INFO][ batch_data] - yielded batch number: 173 with length: 21\n", - "16:21:09[ INFO][ batch_data] - yielded batch number: 174 with length: 21\n", - "16:21:09[ INFO][ batch_data] - yielded batch number: 175 with length: 21\n", - "16:21:09[ INFO][ batch_data] - yielded batch number: 176 with length: 21\n", - "16:21:09[ INFO][ batch_data] - yielded batch number: 177 with length: 20\n", - "16:21:09[ INFO][ batch_data] - yielded batch number: 178 with length: 20\n", - "16:21:09[ INFO][ batch_data] - yielded batch number: 179 with length: 21\n", - "16:21:09[ INFO][ update_rows] - Updated 20 rows in database.\n", - "16:21:10[ INFO][ batch_data] - yielded batch number: 180 with length: 21\n", - "16:21:10[ INFO][ batch_data] - yielded batch number: 181 with length: 21\n", - "16:21:10[ INFO][ embed_objects] - Embedding task completed: Processed 22 rows.\n", - "16:21:10[ INFO][ embed_objects] - Embedding task completed: Processed 22 rows.\n", - "16:21:10[ INFO][ update_rows] - Updated 21 rows in database.\n", - "16:21:10[ INFO][ embed_objects] - Embedding task completed: Processed 22 rows.\n", - "16:21:10[ INFO][ update_rows] - Updated 20 rows in database.\n", - "16:21:10[ INFO][ embed_objects] - Embedding task completed: Processed 21 rows.\n", - "16:21:10[ INFO][ update_rows] - Updated 22 rows in database.\n", - "16:21:10[ INFO][ embed_objects] - Embedding task completed: Processed 22 rows.\n", - "16:21:11[ INFO][ update_rows] - Updated 23 rows in database.\n", - "16:21:11[ INFO][ embed_objects] - Embedding task completed: Processed 21 rows.\n", - "16:21:11[ INFO][ update_rows] - Updated 22 rows in database.\n", - "16:21:11[ INFO][ embed_objects] - Embedding task completed: Processed 21 rows.\n", - "16:21:11[ INFO][ update_rows] - Updated 22 rows in database.\n", - "16:21:11[ INFO][ embed_objects] - Embedding task completed: Processed 22 rows.\n", - "16:21:11[ INFO][ update_rows] - Updated 21 rows in database.\n", - "16:21:11[ INFO][ embed_objects] - Embedding task completed: Processed 21 rows.\n", - "16:21:11[ INFO][ update_rows] - Updated 22 rows in database.\n", - "16:21:11[ INFO][ embed_objects] - Embedding task completed: Processed 21 rows.\n", - "16:21:12[ INFO][ update_rows] - Updated 21 rows in database.\n", - "16:21:12[ INFO][ embed_objects] - Embedding task completed: Processed 21 rows.\n", - "16:21:12[ INFO][ update_rows] - Updated 21 rows in database.\n", - "16:21:12[ INFO][ batch_data] - yielded batch number: 182 with length: 22\n", - "16:21:12[ INFO][ batch_data] - yielded batch number: 183 with length: 21\n", - "16:21:12[ INFO][ batch_data] - yielded batch number: 184 with length: 21\n", - "16:21:12[ INFO][ batch_data] - yielded batch number: 185 with length: 23\n", - "16:21:12[ INFO][ batch_data] - yielded batch number: 186 with length: 21\n", - "16:21:12[ INFO][ batch_data] - yielded batch number: 187 with length: 20\n", - "16:21:12[ INFO][ batch_data] - yielded batch number: 188 with length: 21\n", - "16:21:12[ INFO][ batch_data] - yielded batch number: 189 with length: 21\n", - "16:21:12[ INFO][ batch_data] - yielded batch number: 190 with length: 22\n", - "16:21:12[ INFO][ batch_data] - yielded batch number: 191 with length: 20\n", - "16:21:12[ INFO][ batch_data] - yielded batch number: 192 with length: 22\n", - "16:21:12[ INFO][ embed_objects] - Embedding task completed: Processed 21 rows.\n", - "16:21:12[ INFO][ update_rows] - Updated 22 rows in database.\n", - "16:21:12[ INFO][ embed_objects] - Embedding task completed: Processed 21 rows.\n", - "16:21:12[ INFO][ update_rows] - Updated 21 rows in database.\n", - "16:21:12[ INFO][ embed_objects] - Embedding task completed: Processed 21 rows.\n", - "16:21:12[ INFO][ update_rows] - Updated 21 rows in database.\n", - "16:21:12[ INFO][ embed_objects] - Embedding task completed: Processed 21 rows.\n", - "16:21:13[ INFO][ update_rows] - Updated 21 rows in database.\n", - "16:21:13[ INFO][ embed_objects] - Embedding task completed: Processed 21 rows.\n", - "16:21:13[ INFO][ update_rows] - Updated 21 rows in database.\n", - "16:21:13[ INFO][ embed_objects] - Embedding task completed: Processed 20 rows.\n", - "16:21:13[ INFO][ update_rows] - Updated 21 rows in database.\n", - "16:21:13[ INFO][ embed_objects] - Embedding task completed: Processed 20 rows.\n", - "16:21:14[ INFO][ update_rows] - Updated 21 rows in database.\n", - "16:21:14[ INFO][ embed_objects] - Embedding task completed: Processed 21 rows.\n", - "16:21:14[ INFO][ update_rows] - Updated 22 rows in database.\n", - "16:21:14[ INFO][ embed_objects] - Embedding task completed: Processed 21 rows.\n", - "16:21:14[ INFO][ update_rows] - Updated 21 rows in database.\n", - "16:21:14[ INFO][ batch_data] - yielded batch number: 193 with length: 22\n", - "16:21:14[ INFO][ batch_data] - yielded batch number: 194 with length: 20\n", - "16:21:14[ INFO][ batch_data] - yielded batch number: 195 with length: 22\n", - "16:21:14[ INFO][ batch_data] - yielded batch number: 196 with length: 21\n", - "16:21:14[ INFO][ batch_data] - yielded batch number: 197 with length: 22\n", - "16:21:14[ INFO][ batch_data] - yielded batch number: 198 with length: 23\n", - "16:21:14[ INFO][ batch_data] - yielded batch number: 199 with length: 20\n", - "16:21:14[ INFO][ batch_data] - yielded batch number: 200 with length: 22\n", - "16:21:14[ INFO][ batch_data] - yielded batch number: 201 with length: 22\n", - "16:21:14[ INFO][ embed_objects] - Embedding task completed: Processed 21 rows.\n", - "16:21:14[ INFO][ update_rows] - Updated 21 rows in database.\n", - "16:21:14[ INFO][ embed_objects] - Embedding task completed: Processed 22 rows.\n", - "16:21:14[ INFO][ update_rows] - Updated 21 rows in database.\n", - "16:21:14[ INFO][ embed_objects] - Embedding task completed: Processed 20 rows.\n", - "16:21:15[ INFO][ update_rows] - Updated 20 rows in database.\n", - "16:21:15[ INFO][ embed_objects] - Embedding task completed: Processed 21 rows.\n", - "16:21:15[ INFO][ update_rows] - Updated 21 rows in database.\n", - "16:21:15[ INFO][ embed_objects] - Embedding task completed: Processed 21 rows.\n", - "16:21:15[ INFO][ update_rows] - Updated 21 rows in database.\n", - "16:21:15[ INFO][ embed_objects] - Embedding task completed: Processed 20 rows.\n", - "16:21:15[ INFO][ update_rows] - Updated 21 rows in database.\n", - "16:21:15[ INFO][ embed_objects] - Embedding task completed: Processed 22 rows.\n", - "16:21:15[ INFO][ update_rows] - Updated 22 rows in database.\n", - "16:21:15[ INFO][ embed_objects] - Embedding task completed: Processed 23 rows.\n", - "16:21:16[ INFO][ update_rows] - Updated 20 rows in database.\n", - "16:21:16[ INFO][ embed_objects] - Embedding task completed: Processed 21 rows.\n", - "16:21:16[ INFO][ update_rows] - Updated 20 rows in database.\n", - "16:21:16[ INFO][ embed_objects] - Embedding task completed: Processed 22 rows.\n", - "16:21:16[ INFO][ update_rows] - Updated 20 rows in database.\n", - "16:21:16[ INFO][ embed_objects] - Embedding task completed: Processed 21 rows.\n", - "16:21:16[ INFO][ update_rows] - Updated 21 rows in database.\n", - "16:21:16[ INFO][ batch_data] - yielded batch number: 202 with length: 22\n", - "16:21:16[ INFO][ batch_data] - yielded batch number: 203 with length: 22\n", - "16:21:16[ INFO][ batch_data] - yielded batch number: 204 with length: 22\n", - "16:21:16[ INFO][ batch_data] - yielded batch number: 205 with length: 22\n", - "16:21:16[ INFO][ batch_data] - yielded batch number: 206 with length: 22\n", - "16:21:16[ INFO][ batch_data] - yielded batch number: 207 with length: 22\n", - "16:21:16[ INFO][ batch_data] - yielded batch number: 208 with length: 21\n", - "16:21:16[ INFO][ batch_data] - yielded batch number: 209 with length: 21\n", - "16:21:16[ INFO][ batch_data] - yielded batch number: 210 with length: 21\n", - "16:21:16[ INFO][ batch_data] - yielded batch number: 211 with length: 22\n", - "16:21:16[ INFO][ batch_data] - yielded batch number: 212 with length: 22\n", - "16:21:16[ INFO][ embed_objects] - Embedding task completed: Processed 22 rows.\n", - "16:21:16[ INFO][ update_rows] - Updated 21 rows in database.\n", - "16:21:16[ INFO][ embed_objects] - Embedding task completed: Processed 22 rows.\n", - "16:21:16[ INFO][ update_rows] - Updated 22 rows in database.\n", - "16:21:16[ INFO][ embed_objects] - Embedding task completed: Processed 20 rows.\n", - "16:21:16[ INFO][ update_rows] - Updated 21 rows in database.\n", - "16:21:16[ INFO][ embed_objects] - Embedding task completed: Processed 23 rows.\n", - "16:21:17[ INFO][ update_rows] - Updated 22 rows in database.\n", - "16:21:17[ INFO][ embed_objects] - Embedding task completed: Processed 22 rows.\n", - "16:21:18[ INFO][ update_rows] - Updated 21 rows in database.\n", - "16:21:18[ INFO][ embed_objects] - Embedding task completed: Processed 22 rows.\n", - "16:21:18[ INFO][ update_rows] - Updated 21 rows in database.\n", - "16:21:18[ INFO][ embed_objects] - Embedding task completed: Processed 20 rows.\n", - "16:21:18[ INFO][ update_rows] - Updated 22 rows in database.\n", - "16:21:18[ INFO][ embed_objects] - Embedding task completed: Processed 21 rows.\n", - "16:21:18[ INFO][ update_rows] - Updated 23 rows in database.\n", - "16:21:18[ INFO][ embed_objects] - Embedding task completed: Processed 22 rows.\n", - "16:21:18[ INFO][ update_rows] - Updated 21 rows in database.\n", - "16:21:18[ INFO][ batch_data] - yielded batch number: 213 with length: 20\n", - "16:21:18[ INFO][ batch_data] - yielded batch number: 214 with length: 21\n", - "16:21:18[ INFO][ batch_data] - yielded batch number: 215 with length: 22\n", - "16:21:18[ INFO][ batch_data] - yielded batch number: 216 with length: 22\n", - "16:21:18[ INFO][ batch_data] - yielded batch number: 217 with length: 22\n", - "16:21:18[ INFO][ batch_data] - yielded batch number: 218 with length: 21\n", - "16:21:18[ INFO][ batch_data] - yielded batch number: 219 with length: 21\n", - "16:21:18[ INFO][ batch_data] - yielded batch number: 220 with length: 21\n", - "16:21:18[ INFO][ batch_data] - yielded batch number: 221 with length: 21\n", - "16:21:18[ INFO][ embed_objects] - Embedding task completed: Processed 22 rows.\n", - "16:21:19[ INFO][ update_rows] - Updated 21 rows in database.\n", - "16:21:19[ INFO][ embed_objects] - Embedding task completed: Processed 22 rows.\n", - "16:21:19[ INFO][ update_rows] - Updated 22 rows in database.\n", - "16:21:19[ INFO][ embed_objects] - Embedding task completed: Processed 21 rows.\n", - "16:21:19[ INFO][ update_rows] - Updated 20 rows in database.\n", - "16:21:19[ INFO][ embed_objects] - Embedding task completed: Processed 22 rows.\n", - "16:21:20[ INFO][ update_rows] - Updated 22 rows in database.\n", - "16:21:20[ INFO][ embed_objects] - Embedding task completed: Processed 22 rows.\n", - "16:21:20[ INFO][ update_rows] - Updated 22 rows in database.\n", - "16:21:20[ INFO][ embed_objects] - Embedding task completed: Processed 21 rows.\n", - "16:21:20[ INFO][ update_rows] - Updated 22 rows in database.\n", - "16:21:20[ INFO][ embed_objects] - Embedding task completed: Processed 21 rows.\n", - "16:21:20[ INFO][ update_rows] - Updated 23 rows in database.\n", - "16:21:20[ INFO][ embed_objects] - Embedding task completed: Processed 22 rows.\n", - "16:21:20[ INFO][ update_rows] - Updated 21 rows in database.\n", - "16:21:20[ INFO][ embed_objects] - Embedding task completed: Processed 22 rows.\n", - "16:21:20[ INFO][ update_rows] - Updated 21 rows in database.\n", - "16:21:20[ INFO][ embed_objects] - Embedding task completed: Processed 22 rows.\n", - "16:21:21[ INFO][ update_rows] - Updated 22 rows in database.\n", - "16:21:21[ INFO][ embed_objects] - Embedding task completed: Processed 22 rows.\n", - "16:21:21[ INFO][ update_rows] - Updated 21 rows in database.\n", - "16:21:21[ INFO][ batch_data] - yielded batch number: 222 with length: 21\n", - "16:21:21[ INFO][ batch_data] - yielded batch number: 223 with length: 22\n", - "16:21:21[ INFO][ batch_data] - yielded batch number: 224 with length: 22\n", - "16:21:21[ INFO][ batch_data] - yielded batch number: 225 with length: 23\n", - "16:21:21[ INFO][ batch_data] - yielded batch number: 226 with length: 21\n", - "16:21:21[ INFO][ update_rows] - Updated 22 rows in database.\n", - "16:21:21[ INFO][ update_rows] - Updated 20 rows in database.\n", - "16:21:21[ INFO][ update_rows] - Updated 21 rows in database.\n", - "16:21:21[ INFO][ update_rows] - Updated 22 rows in database.\n", - "16:21:21[ INFO][ update_rows] - Updated 22 rows in database.\n", - "16:21:21[ INFO][ update_rows] - Updated 22 rows in database.\n", - "16:21:21[ INFO][ batch_data] - yielded batch number: 227 with length: 22\n", - "16:21:21[ INFO][ batch_data] - yielded batch number: 228 with length: 22\n", - "16:21:21[ INFO][ batch_data] - yielded batch number: 229 with length: 21\n", - "16:21:21[ INFO][ batch_data] - yielded batch number: 230 with length: 21\n", - "16:21:21[ INFO][ batch_data] - yielded batch number: 231 with length: 22\n", - "16:21:21[ INFO][ batch_data] - yielded batch number: 232 with length: 22\n", - "16:21:21[ INFO][ embed_objects] - Embedding task completed: Processed 21 rows.\n", - "16:21:21[ INFO][ embed_objects] - Embedding task completed: Processed 21 rows.\n", - "16:21:21[ INFO][ embed_objects] - Embedding task completed: Processed 22 rows.\n", - "16:21:21[ INFO][ embed_objects] - Embedding task completed: Processed 22 rows.\n", - "16:21:21[ INFO][ embed_objects] - Embedding task completed: Processed 21 rows.\n", - "16:21:21[ INFO][ embed_objects] - Embedding task completed: Processed 21 rows.\n", - "16:21:21[ INFO][ embed_objects] - Embedding task completed: Processed 20 rows.\n", - "16:21:22[ INFO][ update_rows] - Updated 22 rows in database.\n", - "16:21:22[ INFO][ embed_objects] - Embedding task completed: Processed 22 rows.\n", - "16:21:23[ INFO][ update_rows] - Updated 22 rows in database.\n", - "16:21:23[ INFO][ embed_objects] - Embedding task completed: Processed 21 rows.\n", - "16:21:23[ INFO][ update_rows] - Updated 22 rows in database.\n", - "16:21:23[ INFO][ embed_objects] - Embedding task completed: Processed 21 rows.\n", - "16:21:23[ INFO][ update_rows] - Updated 22 rows in database.\n", - "16:21:23[ INFO][ batch_data] - yielded batch number: 233 with length: 21\n", - "16:21:23[ INFO][ batch_data] - yielded batch number: 234 with length: 23\n", - "16:21:23[ INFO][ batch_data] - yielded batch number: 235 with length: 20\n", - "16:21:23[ INFO][ batch_data] - yielded batch number: 236 with length: 20\n", - "16:21:23[ INFO][ batch_data] - yielded batch number: 237 with length: 22\n", - "16:21:23[ INFO][ batch_data] - yielded batch number: 238 with length: 23\n", - "16:21:23[ INFO][ batch_data] - yielded batch number: 239 with length: 23\n", - "16:21:23[ INFO][ batch_data] - yielded batch number: 240 with length: 22\n", - "16:21:23[ INFO][ batch_data] - yielded batch number: 241 with length: 21\n", - "16:21:23[ INFO][ batch_data] - yielded batch number: 242 with length: 21\n", - "16:21:23[ INFO][ embed_objects] - Embedding task completed: Processed 21 rows.\n", - "16:21:23[ INFO][ update_rows] - Updated 20 rows in database.\n", - "16:21:23[ INFO][ embed_objects] - Embedding task completed: Processed 22 rows.\n", - "16:21:23[ INFO][ update_rows] - Updated 22 rows in database.\n", - "16:21:23[ INFO][ embed_objects] - Embedding task completed: Processed 21 rows.\n", - "16:21:23[ INFO][ update_rows] - Updated 21 rows in database.\n", - "16:21:23[ INFO][ embed_objects] - Embedding task completed: Processed 22 rows.\n", - "16:21:23[ INFO][ update_rows] - Updated 21 rows in database.\n", - "16:21:23[ INFO][ embed_objects] - Embedding task completed: Processed 22 rows.\n", - "16:21:23[ INFO][ update_rows] - Updated 21 rows in database.\n", - "16:21:23[ INFO][ embed_objects] - Embedding task completed: Processed 21 rows.\n", - "16:21:23[ INFO][ update_rows] - Updated 21 rows in database.\n", - "16:21:23[ INFO][ embed_objects] - Embedding task completed: Processed 23 rows.\n", - "16:21:24[ INFO][ update_rows] - Updated 22 rows in database.\n", - "16:21:24[ INFO][ embed_objects] - Embedding task completed: Processed 22 rows.\n", - "16:21:24[ INFO][ update_rows] - Updated 22 rows in database.\n", - "16:21:24[ INFO][ embed_objects] - Embedding task completed: Processed 22 rows.\n", - "16:21:24[ INFO][ update_rows] - Updated 21 rows in database.\n", - "16:21:24[ INFO][ embed_objects] - Embedding task completed: Processed 22 rows.\n", - "16:21:24[ INFO][ update_rows] - Updated 22 rows in database.\n", - "16:21:24[ INFO][ batch_data] - yielded batch number: 243 with length: 21\n", - "16:21:24[ INFO][ batch_data] - yielded batch number: 244 with length: 21\n", - "16:21:24[ INFO][ batch_data] - yielded batch number: 245 with length: 21\n", - "16:21:24[ INFO][ batch_data] - yielded batch number: 246 with length: 21\n", - "16:21:24[ INFO][ batch_data] - yielded batch number: 247 with length: 20\n", - "16:21:24[ INFO][ batch_data] - yielded batch number: 248 with length: 20\n", - "16:21:24[ INFO][ batch_data] - yielded batch number: 249 with length: 22\n", - "16:21:24[ INFO][ batch_data] - yielded batch number: 250 with length: 23\n", - "16:21:24[ INFO][ batch_data] - yielded batch number: 251 with length: 21\n", - "16:21:24[ INFO][ batch_data] - yielded batch number: 252 with length: 21\n", - "16:21:24[ INFO][ embed_objects] - Embedding task completed: Processed 21 rows.\n", - "16:21:24[ INFO][ update_rows] - Updated 21 rows in database.\n", - "16:21:24[ INFO][ embed_objects] - Embedding task completed: Processed 23 rows.\n", - "16:21:25[ INFO][ update_rows] - Updated 21 rows in database.\n", - "16:21:25[ INFO][ embed_objects] - Embedding task completed: Processed 21 rows.\n", - "16:21:25[ INFO][ update_rows] - Updated 22 rows in database.\n", - "16:21:25[ INFO][ embed_objects] - Embedding task completed: Processed 20 rows.\n", - "16:21:26[ INFO][ update_rows] - Updated 22 rows in database.\n", - "16:21:26[ INFO][ embed_objects] - Embedding task completed: Processed 23 rows.\n", - "16:21:26[ INFO][ update_rows] - Updated 22 rows in database.\n", - "16:21:26[ INFO][ embed_objects] - Embedding task completed: Processed 21 rows.\n", - "16:21:26[ INFO][ update_rows] - Updated 21 rows in database.\n", - "16:21:26[ INFO][ update_rows] - Updated 21 rows in database.\n", - "16:21:26[ INFO][ embed_objects] - Embedding task completed: Processed 22 rows.\n", - "16:21:26[ INFO][ embed_objects] - Embedding task completed: Processed 23 rows.\n", - "16:21:26[ INFO][ update_rows] - Updated 23 rows in database.\n", - "16:21:26[ INFO][ embed_objects] - Embedding task completed: Processed 20 rows.\n", - "16:21:26[ INFO][ update_rows] - Updated 21 rows in database.\n", - "16:21:26[ INFO][ embed_objects] - Embedding task completed: Processed 22 rows.\n", - "16:21:27[ INFO][ update_rows] - Updated 23 rows in database.\n", - "16:21:27[ INFO][ batch_data] - yielded batch number: 253 with length: 21\n", - "16:21:27[ INFO][ batch_data] - yielded batch number: 254 with length: 20\n", - "16:21:27[ INFO][ batch_data] - yielded batch number: 255 with length: 22\n", - "16:21:27[ INFO][ batch_data] - yielded batch number: 256 with length: 21\n", - "16:21:27[ INFO][ batch_data] - yielded batch number: 257 with length: 20\n", - "16:21:27[ INFO][ batch_data] - yielded batch number: 258 with length: 20\n", - "16:21:27[ INFO][ batch_data] - yielded batch number: 259 with length: 22\n", - "16:21:27[ INFO][ batch_data] - yielded batch number: 260 with length: 22\n", - "16:21:27[ INFO][ batch_data] - yielded batch number: 261 with length: 21\n", - "16:21:27[ INFO][ batch_data] - yielded batch number: 262 with length: 22\n", - "16:21:27[ INFO][ embed_objects] - Embedding task completed: Processed 22 rows.\n", - "16:21:27[ INFO][ update_rows] - Updated 23 rows in database.\n", - "16:21:27[ INFO][ embed_objects] - Embedding task completed: Processed 20 rows.\n", - "16:21:27[ INFO][ update_rows] - Updated 23 rows in database.\n", - "16:21:27[ INFO][ embed_objects] - Embedding task completed: Processed 21 rows.\n", - "16:21:27[ INFO][ update_rows] - Updated 20 rows in database.\n", - "16:21:27[ INFO][ embed_objects] - Embedding task completed: Processed 21 rows.\n", - "16:21:27[ INFO][ update_rows] - Updated 21 rows in database.\n", - "16:21:27[ INFO][ embed_objects] - Embedding task completed: Processed 20 rows.\n", - "16:21:28[ INFO][ update_rows] - Updated 22 rows in database.\n", - "16:21:28[ INFO][ embed_objects] - Embedding task completed: Processed 21 rows.\n", - "16:21:28[ INFO][ update_rows] - Updated 22 rows in database.\n", - "16:21:28[ INFO][ embed_objects] - Embedding task completed: Processed 21 rows.\n", - "16:21:28[ INFO][ update_rows] - Updated 20 rows in database.\n", - "16:21:28[ INFO][ embed_objects] - Embedding task completed: Processed 23 rows.\n", - "16:21:28[ INFO][ update_rows] - Updated 21 rows in database.\n", - "16:21:28[ INFO][ embed_objects] - Embedding task completed: Processed 21 rows.\n", - "16:21:29[ INFO][ update_rows] - Updated 22 rows in database.\n", - "16:21:29[ INFO][ embed_objects] - Embedding task completed: Processed 21 rows.\n", - "16:21:29[ INFO][ update_rows] - Updated 21 rows in database.\n", - "16:21:29[ INFO][ batch_data] - yielded batch number: 263 with length: 21\n", - "16:21:29[ INFO][ batch_data] - yielded batch number: 264 with length: 21\n", - "16:21:29[ INFO][ batch_data] - yielded batch number: 265 with length: 22\n", - "16:21:29[ INFO][ batch_data] - yielded batch number: 266 with length: 21\n", - "16:21:29[ INFO][ batch_data] - yielded batch number: 267 with length: 20\n", - "16:21:29[ INFO][ batch_data] - yielded batch number: 268 with length: 22\n", - "16:21:29[ INFO][ batch_data] - yielded batch number: 269 with length: 22\n", - "16:21:29[ INFO][ batch_data] - yielded batch number: 270 with length: 21\n", - "16:21:29[ INFO][ batch_data] - yielded batch number: 271 with length: 21\n", - "16:21:29[ INFO][ batch_data] - yielded batch number: 272 with length: 21\n", - "16:21:29[ INFO][ embed_objects] - Embedding task completed: Processed 22 rows.\n", - "16:21:29[ INFO][ update_rows] - Updated 20 rows in database.\n", - "16:21:29[ INFO][ embed_objects] - Embedding task completed: Processed 21 rows.\n", - "16:21:30[ INFO][ update_rows] - Updated 21 rows in database.\n", - "16:21:30[ INFO][ embed_objects] - Embedding task completed: Processed 22 rows.\n", - "16:21:30[ INFO][ update_rows] - Updated 23 rows in database.\n", - "16:21:30[ INFO][ update_rows] - Updated 21 rows in database.\n", - "16:21:30[ INFO][ embed_objects] - Embedding task completed: Processed 21 rows.\n", - "16:21:30[ INFO][ embed_objects] - Embedding task completed: Processed 22 rows.\n", - "16:21:30[ INFO][ update_rows] - Updated 22 rows in database.\n", - "16:21:30[ INFO][ embed_objects] - Embedding task completed: Processed 20 rows.\n", - "16:21:30[ INFO][ update_rows] - Updated 21 rows in database.\n", - "16:21:30[ INFO][ embed_objects] - Embedding task completed: Processed 20 rows.\n", - "16:21:31[ INFO][ update_rows] - Updated 21 rows in database.\n", - "16:21:31[ INFO][ embed_objects] - Embedding task completed: Processed 22 rows.\n", - "16:21:31[ INFO][ update_rows] - Updated 21 rows in database.\n", - "16:21:31[ INFO][ embed_objects] - Embedding task completed: Processed 21 rows.\n", - "16:21:31[ INFO][ update_rows] - Updated 21 rows in database.\n", - "16:21:31[ INFO][ embed_objects] - Embedding task completed: Processed 20 rows.\n", - "16:21:32[ INFO][ update_rows] - Updated 21 rows in database.\n", - "16:21:32[ INFO][ batch_data] - yielded batch number: 273 with length: 20\n", - "16:21:32[ INFO][ update_rows] - Updated 20 rows in database.\n", - "16:21:32[ INFO][ update_rows] - Updated 22 rows in database.\n", - "16:21:33[ INFO][ update_rows] - Updated 22 rows in database.\n", - "16:21:33[ INFO][ batch_data] - yielded batch number: 274 with length: 22\n", - "16:21:33[ INFO][ batch_data] - yielded batch number: 275 with length: 22\n", - "16:21:33[ INFO][ batch_data] - yielded batch number: 276 with length: 22\n", - "16:21:33[ INFO][ batch_data] - yielded batch number: 277 with length: 21\n", - "16:21:33[ INFO][ batch_data] - yielded batch number: 278 with length: 20\n", - "16:21:33[ INFO][ batch_data] - yielded batch number: 279 with length: 22\n", - "16:21:33[ INFO][ batch_data] - yielded batch number: 280 with length: 21\n", - "16:21:33[ INFO][ batch_data] - yielded batch number: 281 with length: 22\n", - "16:21:33[ INFO][ batch_data] - yielded batch number: 282 with length: 22\n", - "16:21:33[ INFO][ update_rows] - Updated 20 rows in database.\n", - "16:21:33[ INFO][ embed_objects] - Embedding task completed: Processed 21 rows.\n", - "16:21:33[ INFO][ embed_objects] - Embedding task completed: Processed 21 rows.\n", - "16:21:33[ INFO][ embed_objects] - Embedding task completed: Processed 21 rows.\n", - "16:21:33[ INFO][ embed_objects] - Embedding task completed: Processed 21 rows.\n", - "16:21:33[ INFO][ embed_objects] - Embedding task completed: Processed 22 rows.\n", - "16:21:33[ INFO][ update_rows] - Updated 20 rows in database.\n", - "16:21:33[ INFO][ embed_objects] - Embedding task completed: Processed 22 rows.\n", - "16:21:33[ INFO][ update_rows] - Updated 21 rows in database.\n", - "16:21:33[ INFO][ embed_objects] - Embedding task completed: Processed 20 rows.\n", - "16:21:33[ INFO][ update_rows] - Updated 22 rows in database.\n", - "16:21:33[ INFO][ embed_objects] - Embedding task completed: Processed 22 rows.\n", - "16:21:34[ INFO][ update_rows] - Updated 22 rows in database.\n", - "16:21:34[ INFO][ embed_objects] - Embedding task completed: Processed 20 rows.\n", - "16:21:34[ INFO][ update_rows] - Updated 21 rows in database.\n", - "16:21:34[ INFO][ embed_objects] - Embedding task completed: Processed 21 rows.\n", - "16:21:34[ INFO][ update_rows] - Updated 20 rows in database.\n", - "16:21:34[ INFO][ embed_objects] - Embedding task completed: Processed 21 rows.\n", - "16:21:34[ INFO][ update_rows] - Updated 21 rows in database.\n", - "16:21:34[ INFO][ batch_data] - yielded batch number: 283 with length: 22\n", - "16:21:34[ INFO][ batch_data] - yielded batch number: 284 with length: 22\n", - "16:21:34[ INFO][ batch_data] - yielded batch number: 285 with length: 20\n", - "16:21:34[ INFO][ batch_data] - yielded batch number: 286 with length: 21\n", - "16:21:34[ INFO][ batch_data] - yielded batch number: 287 with length: 20\n", - "16:21:34[ INFO][ batch_data] - yielded batch number: 288 with length: 20\n", - "16:21:34[ INFO][ batch_data] - yielded batch number: 289 with length: 20\n", - "16:21:34[ INFO][ batch_data] - yielded batch number: 290 with length: 21\n", - "16:21:34[ INFO][ batch_data] - yielded batch number: 291 with length: 21\n", - "16:21:34[ INFO][ batch_data] - yielded batch number: 292 with length: 22\n", - "16:21:34[ INFO][ batch_data] - yielded batch number: 293 with length: 21\n", - "16:21:34[ INFO][ embed_objects] - Embedding task completed: Processed 22 rows.\n", - "16:21:35[ INFO][ update_rows] - Updated 21 rows in database.\n", - "16:21:35[ INFO][ embed_objects] - Embedding task completed: Processed 20 rows.\n", - "16:21:35[ INFO][ update_rows] - Updated 20 rows in database.\n", - "16:21:35[ INFO][ embed_objects] - Embedding task completed: Processed 22 rows.\n", - "16:21:35[ INFO][ update_rows] - Updated 22 rows in database.\n", - "16:21:35[ INFO][ embed_objects] - Embedding task completed: Processed 22 rows.\n", - "16:21:35[ INFO][ update_rows] - Updated 21 rows in database.\n", - "16:21:35[ INFO][ embed_objects] - Embedding task completed: Processed 21 rows.\n", - "16:21:36[ INFO][ update_rows] - Updated 20 rows in database.\n", - "16:21:36[ INFO][ embed_objects] - Embedding task completed: Processed 22 rows.\n", - "16:21:36[ INFO][ update_rows] - Updated 21 rows in database.\n", - "16:21:36[ INFO][ embed_objects] - Embedding task completed: Processed 21 rows.\n", - "16:21:36[ INFO][ update_rows] - Updated 20 rows in database.\n", - "16:21:36[ INFO][ embed_objects] - Embedding task completed: Processed 22 rows.\n", - "16:21:36[ INFO][ update_rows] - Updated 22 rows in database.\n", - "16:21:36[ INFO][ embed_objects] - Embedding task completed: Processed 22 rows.\n", - "16:21:36[ INFO][ update_rows] - Updated 22 rows in database.\n", - "16:21:36[ INFO][ batch_data] - yielded batch number: 294 with length: 22\n", - "16:21:36[ INFO][ batch_data] - yielded batch number: 295 with length: 23\n", - "16:21:36[ INFO][ batch_data] - yielded batch number: 296 with length: 21\n", - "16:21:36[ INFO][ batch_data] - yielded batch number: 297 with length: 22\n", - "16:21:36[ INFO][ batch_data] - yielded batch number: 298 with length: 22\n", - "16:21:36[ INFO][ batch_data] - yielded batch number: 299 with length: 21\n", - "16:21:36[ INFO][ batch_data] - yielded batch number: 300 with length: 21\n", - "16:21:36[ INFO][ batch_data] - yielded batch number: 301 with length: 21\n", - "16:21:36[ INFO][ batch_data] - yielded batch number: 302 with length: 22\n", - "16:21:36[ INFO][ embed_objects] - Embedding task completed: Processed 20 rows.\n", - "16:21:36[ INFO][ update_rows] - Updated 22 rows in database.\n", - "16:21:36[ INFO][ embed_objects] - Embedding task completed: Processed 22 rows.\n", - "16:21:37[ INFO][ update_rows] - Updated 22 rows in database.\n", - "16:21:37[ INFO][ embed_objects] - Embedding task completed: Processed 21 rows.\n", - "16:21:37[ INFO][ update_rows] - Updated 21 rows in database.\n", - "16:21:37[ INFO][ embed_objects] - Embedding task completed: Processed 20 rows.\n", - "16:21:37[ INFO][ update_rows] - Updated 21 rows in database.\n", - "16:21:37[ INFO][ embed_objects] - Embedding task completed: Processed 20 rows.\n", - "16:21:37[ INFO][ update_rows] - Updated 22 rows in database.\n", - "16:21:37[ INFO][ embed_objects] - Embedding task completed: Processed 22 rows.\n", - "16:21:38[ INFO][ update_rows] - Updated 22 rows in database.\n", - "16:21:38[ INFO][ embed_objects] - Embedding task completed: Processed 20 rows.\n", - "16:21:38[ INFO][ update_rows] - Updated 22 rows in database.\n", - "16:21:38[ INFO][ embed_objects] - Embedding task completed: Processed 21 rows.\n", - "16:21:38[ INFO][ update_rows] - Updated 22 rows in database.\n", - "16:21:38[ INFO][ embed_objects] - Embedding task completed: Processed 21 rows.\n", - "16:21:38[ INFO][ update_rows] - Updated 20 rows in database.\n", - "16:21:38[ INFO][ embed_objects] - Embedding task completed: Processed 21 rows.\n", - "16:21:39[ INFO][ update_rows] - Updated 22 rows in database.\n", - "16:21:39[ INFO][ embed_objects] - Embedding task completed: Processed 22 rows.\n", - "16:21:39[ INFO][ update_rows] - Updated 21 rows in database.\n", - "16:21:39[ INFO][ batch_data] - yielded batch number: 303 with length: 20\n", - "16:21:39[ INFO][ batch_data] - yielded batch number: 304 with length: 22\n", - "16:21:39[ INFO][ batch_data] - yielded batch number: 305 with length: 22\n", - "16:21:39[ INFO][ batch_data] - yielded batch number: 306 with length: 22\n", - "16:21:39[ INFO][ batch_data] - yielded batch number: 307 with length: 21\n", - "16:21:39[ INFO][ batch_data] - yielded batch number: 308 with length: 21\n", - "16:21:39[ INFO][ batch_data] - yielded batch number: 309 with length: 23\n", - "16:21:39[ INFO][ batch_data] - yielded batch number: 310 with length: 22\n", - "16:21:39[ INFO][ batch_data] - yielded batch number: 311 with length: 23\n", - "16:21:39[ INFO][ batch_data] - yielded batch number: 312 with length: 21\n", - "16:21:39[ INFO][ batch_data] - yielded batch number: 313 with length: 22\n", - "16:21:39[ INFO][ embed_objects] - Embedding task completed: Processed 22 rows.\n", - "16:21:39[ INFO][ update_rows] - Updated 21 rows in database.\n", - "16:21:39[ INFO][ embed_objects] - Embedding task completed: Processed 21 rows.\n", - "16:21:39[ INFO][ update_rows] - Updated 21 rows in database.\n", - "16:21:39[ INFO][ embed_objects] - Embedding task completed: Processed 21 rows.\n", - "16:21:40[ INFO][ update_rows] - Updated 21 rows in database.\n", - "16:21:40[ INFO][ embed_objects] - Embedding task completed: Processed 21 rows.\n", - "16:21:40[ INFO][ update_rows] - Updated 22 rows in database.\n", - "16:21:40[ INFO][ embed_objects] - Embedding task completed: Processed 22 rows.\n", - "16:21:40[ INFO][ update_rows] - Updated 22 rows in database.\n", - "16:21:40[ INFO][ embed_objects] - Embedding task completed: Processed 22 rows.\n", - "16:21:40[ INFO][ update_rows] - Updated 20 rows in database.\n", - "16:21:40[ INFO][ embed_objects] - Embedding task completed: Processed 21 rows.\n", - "16:21:40[ INFO][ update_rows] - Updated 20 rows in database.\n", - "16:21:40[ INFO][ embed_objects] - Embedding task completed: Processed 22 rows.\n", - "16:21:41[ INFO][ update_rows] - Updated 21 rows in database.\n", - "16:21:41[ INFO][ embed_objects] - Embedding task completed: Processed 23 rows.\n", - "16:21:41[ INFO][ update_rows] - Updated 20 rows in database.\n", - "16:21:41[ INFO][ batch_data] - yielded batch number: 314 with length: 20\n", - "16:21:41[ INFO][ batch_data] - yielded batch number: 315 with length: 20\n", - "16:21:41[ INFO][ batch_data] - yielded batch number: 316 with length: 21\n", - "16:21:41[ INFO][ batch_data] - yielded batch number: 317 with length: 22\n", - "16:21:41[ INFO][ batch_data] - yielded batch number: 318 with length: 21\n", - "16:21:41[ INFO][ batch_data] - yielded batch number: 319 with length: 21\n", - "16:21:41[ INFO][ update_rows] - Updated 21 rows in database.\n", - "16:21:41[ INFO][ update_rows] - Updated 21 rows in database.\n", - "16:21:42[ INFO][ batch_data] - yielded batch number: 320 with length: 21\n", - "16:21:42[ INFO][ batch_data] - yielded batch number: 321 with length: 20\n", - "16:21:42[ INFO][ batch_data] - yielded batch number: 322 with length: 22\n", - "16:21:42[ INFO][ embed_objects] - Embedding task completed: Processed 21 rows.\n", - "16:21:42[ INFO][ embed_objects] - Embedding task completed: Processed 22 rows.\n", - "16:21:42[ INFO][ embed_objects] - Embedding task completed: Processed 21 rows.\n", - "16:21:42[ INFO][ update_rows] - Updated 22 rows in database.\n", - "16:21:42[ INFO][ embed_objects] - Embedding task completed: Processed 22 rows.\n", - "16:21:42[ INFO][ update_rows] - Updated 22 rows in database.\n", - "16:21:42[ INFO][ embed_objects] - Embedding task completed: Processed 20 rows.\n", - "16:21:42[ INFO][ update_rows] - Updated 21 rows in database.\n", - "16:21:42[ INFO][ embed_objects] - Embedding task completed: Processed 23 rows.\n", - "16:21:42[ INFO][ update_rows] - Updated 22 rows in database.\n", - "16:21:42[ INFO][ embed_objects] - Embedding task completed: Processed 20 rows.\n", - "16:21:43[ INFO][ update_rows] - Updated 21 rows in database.\n", - "16:21:43[ INFO][ embed_objects] - Embedding task completed: Processed 22 rows.\n", - "16:21:43[ INFO][ update_rows] - Updated 22 rows in database.\n", - "16:21:43[ INFO][ embed_objects] - Embedding task completed: Processed 22 rows.\n", - "16:21:43[ INFO][ update_rows] - Updated 22 rows in database.\n", - "16:21:43[ INFO][ embed_objects] - Embedding task completed: Processed 21 rows.\n", - "16:21:43[ INFO][ update_rows] - Updated 20 rows in database.\n", - "16:21:43[ INFO][ embed_objects] - Embedding task completed: Processed 22 rows.\n", - "16:21:43[ INFO][ update_rows] - Updated 23 rows in database.\n", - "16:21:43[ INFO][ embed_objects] - Embedding task completed: Processed 23 rows.\n", - "16:21:43[ INFO][ update_rows] - Updated 23 rows in database.\n", - "16:21:43[ INFO][ batch_data] - yielded batch number: 323 with length: 23\n", - "16:21:43[ INFO][ batch_data] - yielded batch number: 324 with length: 22\n", - "16:21:43[ INFO][ batch_data] - yielded batch number: 325 with length: 21\n", - "16:21:43[ INFO][ batch_data] - yielded batch number: 326 with length: 21\n", - "16:21:43[ INFO][ batch_data] - yielded batch number: 327 with length: 21\n", - "16:21:43[ INFO][ batch_data] - yielded batch number: 328 with length: 22\n", - "16:21:43[ INFO][ batch_data] - yielded batch number: 329 with length: 22\n", - "16:21:43[ INFO][ batch_data] - yielded batch number: 330 with length: 22\n", - "16:21:43[ INFO][ batch_data] - yielded batch number: 331 with length: 22\n", - "16:21:43[ INFO][ batch_data] - yielded batch number: 332 with length: 23\n", - "16:21:43[ INFO][ batch_data] - yielded batch number: 333 with length: 21\n", - "16:21:43[ INFO][ batch_data] - yielded batch number: 334 with length: 20\n", - "16:21:43[ INFO][ embed_objects] - Embedding task completed: Processed 20 rows.\n", - "16:21:44[ INFO][ update_rows] - Updated 20 rows in database.\n", - "16:21:44[ INFO][ embed_objects] - Embedding task completed: Processed 21 rows.\n", - "16:21:44[ INFO][ update_rows] - Updated 22 rows in database.\n", - "16:21:44[ INFO][ embed_objects] - Embedding task completed: Processed 21 rows.\n", - "16:21:45[ INFO][ update_rows] - Updated 22 rows in database.\n", - "16:21:45[ INFO][ embed_objects] - Embedding task completed: Processed 21 rows.\n", - "16:21:45[ INFO][ update_rows] - Updated 22 rows in database.\n", - "16:21:45[ INFO][ embed_objects] - Embedding task completed: Processed 22 rows.\n", - "16:21:45[ INFO][ update_rows] - Updated 21 rows in database.\n", - "16:21:45[ INFO][ embed_objects] - Embedding task completed: Processed 21 rows.\n", - "16:21:45[ INFO][ update_rows] - Updated 21 rows in database.\n", - "16:21:45[ INFO][ embed_objects] - Embedding task completed: Processed 22 rows.\n", - "16:21:45[ INFO][ update_rows] - Updated 22 rows in database.\n", - "16:21:45[ INFO][ embed_objects] - Embedding task completed: Processed 20 rows.\n", - "16:21:45[ INFO][ update_rows] - Updated 21 rows in database.\n", - "16:21:45[ INFO][ batch_data] - yielded batch number: 335 with length: 21\n", - "16:21:45[ INFO][ batch_data] - yielded batch number: 336 with length: 21\n", - "16:21:45[ INFO][ batch_data] - yielded batch number: 337 with length: 21\n", - "16:21:45[ INFO][ batch_data] - yielded batch number: 338 with length: 23\n", - "16:21:45[ INFO][ batch_data] - yielded batch number: 339 with length: 22\n", - "16:21:45[ INFO][ batch_data] - yielded batch number: 340 with length: 21\n", - "16:21:45[ INFO][ batch_data] - yielded batch number: 341 with length: 20\n", - "16:21:45[ INFO][ batch_data] - yielded batch number: 342 with length: 20\n", - "16:21:45[ INFO][ embed_objects] - Embedding task completed: Processed 22 rows.\n", - "16:21:46[ INFO][ update_rows] - Updated 21 rows in database.\n", - "16:21:46[ INFO][ embed_objects] - Embedding task completed: Processed 22 rows.\n", - "16:21:46[ INFO][ update_rows] - Updated 23 rows in database.\n", - "16:21:46[ INFO][ embed_objects] - Embedding task completed: Processed 21 rows.\n", - "16:21:47[ INFO][ update_rows] - Updated 21 rows in database.\n", - "16:21:47[ INFO][ embed_objects] - Embedding task completed: Processed 22 rows.\n", - "16:21:47[ INFO][ update_rows] - Updated 20 rows in database.\n", - "16:21:47[ INFO][ embed_objects] - Embedding task completed: Processed 23 rows.\n", - "16:21:47[ INFO][ update_rows] - Updated 22 rows in database.\n", - "16:21:47[ INFO][ embed_objects] - Embedding task completed: Processed 21 rows.\n", - "16:21:47[ INFO][ update_rows] - Updated 22 rows in database.\n", - "16:21:47[ INFO][ embed_objects] - Embedding task completed: Processed 21 rows.\n", - "16:21:48[ INFO][ update_rows] - Updated 22 rows in database.\n", - "16:21:48[ INFO][ embed_objects] - Embedding task completed: Processed 23 rows.\n", - "16:21:48[ INFO][ update_rows] - Updated 22 rows in database.\n", - "16:21:48[ INFO][ embed_objects] - Embedding task completed: Processed 21 rows.\n", - "16:21:48[ INFO][ update_rows] - Updated 21 rows in database.\n", - "16:21:48[ INFO][ embed_objects] - Embedding task completed: Processed 22 rows.\n", - "16:21:48[ INFO][ update_rows] - Updated 21 rows in database.\n", - "16:21:48[ INFO][ embed_objects] - Embedding task completed: Processed 20 rows.\n", - "16:21:48[ INFO][ update_rows] - Updated 21 rows in database.\n", - "16:21:48[ INFO][ embed_objects] - Embedding task completed: Processed 22 rows.\n", - "16:21:48[ INFO][ update_rows] - Updated 23 rows in database.\n", - "16:21:48[ INFO][ batch_data] - yielded batch number: 343 with length: 22\n", - "16:21:48[ INFO][ batch_data] - yielded batch number: 344 with length: 22\n", - "16:21:48[ INFO][ batch_data] - yielded batch number: 345 with length: 19\n", - "16:21:48[ INFO][ batch_data] - yielded batch number: 346 with length: 21\n", - "16:21:48[ INFO][ batch_data] - yielded batch number: 347 with length: 20\n", - "16:21:48[ INFO][ batch_data] - yielded batch number: 348 with length: 21\n", - "16:21:48[ INFO][ batch_data] - yielded batch number: 349 with length: 23\n", - "16:21:48[ INFO][ batch_data] - yielded batch number: 350 with length: 21\n", - "16:21:48[ INFO][ batch_data] - yielded batch number: 351 with length: 21\n", - "16:21:48[ INFO][ batch_data] - yielded batch number: 352 with length: 21\n", - "16:21:48[ INFO][ batch_data] - yielded batch number: 353 with length: 19\n", - "16:21:48[ INFO][ batch_data] - yielded batch number: 354 with length: 20\n", - "16:21:48[ INFO][ embed_objects] - Embedding task completed: Processed 23 rows.\n", - "16:21:49[ INFO][ update_rows] - Updated 21 rows in database.\n", - "16:21:49[ INFO][ embed_objects] - Embedding task completed: Processed 21 rows.\n", - "16:21:49[ INFO][ update_rows] - Updated 21 rows in database.\n", - "16:21:49[ INFO][ embed_objects] - Embedding task completed: Processed 21 rows.\n", - "16:21:49[ INFO][ update_rows] - Updated 22 rows in database.\n", - "16:21:49[ INFO][ embed_objects] - Embedding task completed: Processed 22 rows.\n", - "16:21:49[ INFO][ update_rows] - Updated 21 rows in database.\n", - "16:21:49[ INFO][ embed_objects] - Embedding task completed: Processed 21 rows.\n", - "16:21:49[ INFO][ update_rows] - Updated 23 rows in database.\n", - "16:21:49[ INFO][ embed_objects] - Embedding task completed: Processed 20 rows.\n", - "16:21:50[ INFO][ update_rows] - Updated 20 rows in database.\n", - "16:21:50[ INFO][ embed_objects] - Embedding task completed: Processed 20 rows.\n", - "16:21:50[ INFO][ update_rows] - Updated 20 rows in database.\n", - "16:21:50[ INFO][ update_rows] - Updated 22 rows in database.\n", - "16:21:50[ INFO][ embed_objects] - Embedding task completed: Processed 21 rows.\n", - "16:21:50[ INFO][ batch_data] - yielded batch number: 355 with length: 22\n", - "16:21:50[ INFO][ batch_data] - yielded batch number: 356 with length: 22\n", - "16:21:50[ INFO][ batch_data] - yielded batch number: 357 with length: 22\n", - "16:21:50[ INFO][ batch_data] - yielded batch number: 358 with length: 20\n", - "16:21:50[ INFO][ batch_data] - yielded batch number: 359 with length: 21\n", - "16:21:50[ INFO][ batch_data] - yielded batch number: 360 with length: 20\n", - "16:21:50[ INFO][ batch_data] - yielded batch number: 361 with length: 22\n", - "16:21:50[ INFO][ batch_data] - yielded batch number: 362 with length: 20\n", - "16:21:50[ INFO][ embed_objects] - Embedding task completed: Processed 21 rows.\n", - "16:21:50[ INFO][ update_rows] - Updated 22 rows in database.\n", - "16:21:50[ INFO][ embed_objects] - Embedding task completed: Processed 21 rows.\n", - "16:21:51[ INFO][ update_rows] - Updated 21 rows in database.\n", - "16:21:51[ INFO][ embed_objects] - Embedding task completed: Processed 22 rows.\n", - "16:21:51[ INFO][ update_rows] - Updated 21 rows in database.\n", - "16:21:51[ INFO][ update_rows] - Updated 22 rows in database.\n", - "16:21:51[ INFO][ embed_objects] - Embedding task completed: Processed 21 rows.\n", - "16:21:51[ INFO][ embed_objects] - Embedding task completed: Processed 20 rows.\n", - "16:21:51[ INFO][ update_rows] - Updated 23 rows in database.\n", - "16:21:51[ INFO][ embed_objects] - Embedding task completed: Processed 21 rows.\n", - "16:21:51[ INFO][ update_rows] - Updated 20 rows in database.\n", - "16:21:51[ INFO][ embed_objects] - Embedding task completed: Processed 21 rows.\n", - "16:21:51[ INFO][ update_rows] - Updated 21 rows in database.\n", - "16:21:51[ INFO][ embed_objects] - Embedding task completed: Processed 19 rows.\n", - "16:21:52[ INFO][ update_rows] - Updated 21 rows in database.\n", - "16:21:52[ INFO][ embed_objects] - Embedding task completed: Processed 23 rows.\n", - "16:21:52[ INFO][ update_rows] - Updated 21 rows in database.\n", - "16:21:52[ INFO][ embed_objects] - Embedding task completed: Processed 20 rows.\n", - "16:21:52[ INFO][ update_rows] - Updated 21 rows in database.\n", - "16:21:52[ INFO][ embed_objects] - Embedding task completed: Processed 19 rows.\n", - "16:21:52[ INFO][ update_rows] - Updated 20 rows in database.\n", - "16:21:52[ INFO][ embed_objects] - Embedding task completed: Processed 22 rows.\n", - "16:21:53[ INFO][ update_rows] - Updated 19 rows in database.\n", - "16:21:53[ INFO][ batch_data] - yielded batch number: 363 with length: 23\n", - "16:21:53[ INFO][ batch_data] - yielded batch number: 364 with length: 20\n", - "16:21:53[ INFO][ batch_data] - yielded batch number: 365 with length: 21\n", - "16:21:53[ INFO][ batch_data] - yielded batch number: 366 with length: 22\n", - "16:21:53[ INFO][ batch_data] - yielded batch number: 367 with length: 21\n", - "16:21:53[ INFO][ update_rows] - Updated 21 rows in database.\n", - "16:21:53[ INFO][ update_rows] - Updated 21 rows in database.\n", - "16:21:53[ INFO][ update_rows] - Updated 21 rows in database.\n", - "16:21:53[ INFO][ batch_data] - yielded batch number: 368 with length: 20\n", - "16:21:53[ INFO][ batch_data] - yielded batch number: 369 with length: 24\n", - "16:21:53[ INFO][ batch_data] - yielded batch number: 370 with length: 21\n", - "16:21:53[ INFO][ batch_data] - yielded batch number: 371 with length: 21\n", - "16:21:53[ INFO][ batch_data] - yielded batch number: 372 with length: 22\n", - "16:21:53[ INFO][ batch_data] - yielded batch number: 373 with length: 24\n", - "16:21:53[ INFO][ batch_data] - yielded batch number: 374 with length: 21\n", - "16:21:53[ INFO][ embed_objects] - Embedding task completed: Processed 22 rows.\n", - "16:21:54[ INFO][ embed_objects] - Embedding task completed: Processed 23 rows.\n", - "16:21:54[ INFO][ embed_objects] - Embedding task completed: Processed 20 rows.\n", - "16:21:54[ INFO][ embed_objects] - Embedding task completed: Processed 22 rows.\n", - "16:21:54[ INFO][ update_rows] - Updated 22 rows in database.\n", - "16:21:54[ INFO][ embed_objects] - Embedding task completed: Processed 22 rows.\n", - "16:21:54[ INFO][ update_rows] - Updated 20 rows in database.\n", - "16:21:54[ INFO][ embed_objects] - Embedding task completed: Processed 20 rows.\n", - "16:21:54[ INFO][ update_rows] - Updated 22 rows in database.\n", - "16:21:54[ INFO][ embed_objects] - Embedding task completed: Processed 22 rows.\n", - "16:21:55[ INFO][ update_rows] - Updated 22 rows in database.\n", - "16:21:55[ INFO][ embed_objects] - Embedding task completed: Processed 21 rows.\n", - "16:21:55[ INFO][ update_rows] - Updated 19 rows in database.\n", - "16:21:55[ INFO][ embed_objects] - Embedding task completed: Processed 20 rows.\n", - "16:21:55[ INFO][ update_rows] - Updated 22 rows in database.\n", - "16:21:55[ INFO][ batch_data] - yielded batch number: 375 with length: 20\n", - "16:21:55[ INFO][ batch_data] - yielded batch number: 376 with length: 21\n", - "16:21:55[ INFO][ batch_data] - yielded batch number: 377 with length: 21\n", - "16:21:55[ INFO][ batch_data] - yielded batch number: 378 with length: 20\n", - "16:21:55[ INFO][ batch_data] - yielded batch number: 379 with length: 21\n", - "16:21:55[ INFO][ batch_data] - yielded batch number: 380 with length: 21\n", - "16:21:55[ INFO][ batch_data] - yielded batch number: 381 with length: 21\n", - "16:21:55[ INFO][ batch_data] - yielded batch number: 382 with length: 20\n", - "16:21:55[ INFO][ batch_data] - yielded batch number: 383 with length: 21\n", - "16:21:55[ INFO][ embed_objects] - Embedding task completed: Processed 21 rows.\n", - "16:21:55[ INFO][ update_rows] - Updated 20 rows in database.\n", - "16:21:55[ INFO][ embed_objects] - Embedding task completed: Processed 22 rows.\n", - "16:21:55[ INFO][ update_rows] - Updated 23 rows in database.\n", - "16:21:55[ INFO][ embed_objects] - Embedding task completed: Processed 24 rows.\n", - "16:21:56[ INFO][ update_rows] - Updated 22 rows in database.\n", - "16:21:56[ INFO][ embed_objects] - Embedding task completed: Processed 21 rows.\n", - "16:21:56[ INFO][ update_rows] - Updated 23 rows in database.\n", - "16:21:56[ INFO][ embed_objects] - Embedding task completed: Processed 24 rows.\n", - "16:21:56[ INFO][ update_rows] - Updated 21 rows in database.\n", - "16:21:56[ INFO][ embed_objects] - Embedding task completed: Processed 20 rows.\n", - "16:21:56[ INFO][ update_rows] - Updated 21 rows in database.\n", - "16:21:56[ INFO][ embed_objects] - Embedding task completed: Processed 21 rows.\n", - "16:21:57[ INFO][ update_rows] - Updated 20 rows in database.\n", - "16:21:57[ INFO][ embed_objects] - Embedding task completed: Processed 22 rows.\n", - "16:21:57[ INFO][ update_rows] - Updated 24 rows in database.\n", - "16:21:57[ INFO][ embed_objects] - Embedding task completed: Processed 20 rows.\n", - "16:21:57[ INFO][ update_rows] - Updated 22 rows in database.\n", - "16:21:57[ INFO][ embed_objects] - Embedding task completed: Processed 21 rows.\n", - "16:21:57[ INFO][ update_rows] - Updated 21 rows in database.\n", - "16:21:57[ INFO][ embed_objects] - Embedding task completed: Processed 21 rows.\n", - "16:21:57[ INFO][ update_rows] - Updated 22 rows in database.\n", - "16:21:57[ INFO][ batch_data] - yielded batch number: 384 with length: 20\n", - "16:21:57[ INFO][ batch_data] - yielded batch number: 385 with length: 21\n", - "16:21:57[ INFO][ batch_data] - yielded batch number: 386 with length: 22\n", - "16:21:57[ INFO][ batch_data] - yielded batch number: 387 with length: 20\n", - "16:21:57[ INFO][ batch_data] - yielded batch number: 388 with length: 21\n", - "16:21:57[ INFO][ batch_data] - yielded batch number: 389 with length: 21\n", - "16:21:57[ INFO][ batch_data] - yielded batch number: 390 with length: 20\n", - "16:21:57[ INFO][ batch_data] - yielded batch number: 391 with length: 21\n", - "16:21:57[ INFO][ batch_data] - yielded batch number: 392 with length: 19\n", - "16:21:57[ INFO][ batch_data] - yielded batch number: 393 with length: 23\n", - "16:21:57[ INFO][ batch_data] - yielded batch number: 394 with length: 20\n", - "16:21:57[ INFO][ embed_objects] - Embedding task completed: Processed 21 rows.\n", - "16:21:57[ INFO][ update_rows] - Updated 20 rows in database.\n", - "16:21:57[ INFO][ embed_objects] - Embedding task completed: Processed 21 rows.\n", - "16:21:58[ INFO][ update_rows] - Updated 22 rows in database.\n", - "16:21:58[ INFO][ embed_objects] - Embedding task completed: Processed 21 rows.\n", - "16:21:58[ INFO][ update_rows] - Updated 20 rows in database.\n", - "16:21:58[ INFO][ embed_objects] - Embedding task completed: Processed 20 rows.\n", - "16:21:58[ INFO][ update_rows] - Updated 21 rows in database.\n", - "16:21:58[ INFO][ embed_objects] - Embedding task completed: Processed 21 rows.\n", - "16:21:58[ INFO][ update_rows] - Updated 20 rows in database.\n", - "16:21:58[ INFO][ embed_objects] - Embedding task completed: Processed 20 rows.\n", - "16:21:59[ INFO][ update_rows] - Updated 20 rows in database.\n", - "16:21:59[ INFO][ update_rows] - Updated 24 rows in database.\n", - "16:21:59[ INFO][ embed_objects] - Embedding task completed: Processed 20 rows.\n", - "16:21:59[ INFO][ embed_objects] - Embedding task completed: Processed 21 rows.\n", - "16:21:59[ INFO][ update_rows] - Updated 21 rows in database.\n", - "16:21:59[ INFO][ embed_objects] - Embedding task completed: Processed 21 rows.\n", - "16:21:59[ INFO][ update_rows] - Updated 21 rows in database.\n", - "16:21:59[ INFO][ batch_data] - yielded batch number: 395 with length: 22\n", - "16:21:59[ INFO][ batch_data] - yielded batch number: 396 with length: 20\n", - "16:21:59[ INFO][ batch_data] - yielded batch number: 397 with length: 21\n", - "16:21:59[ INFO][ batch_data] - yielded batch number: 398 with length: 20\n", - "16:21:59[ INFO][ batch_data] - yielded batch number: 399 with length: 23\n", - "16:21:59[ INFO][ batch_data] - yielded batch number: 400 with length: 22\n", - "16:21:59[ INFO][ batch_data] - yielded batch number: 401 with length: 23\n", - "16:21:59[ INFO][ batch_data] - yielded batch number: 402 with length: 22\n", - "16:21:59[ INFO][ batch_data] - yielded batch number: 403 with length: 22\n", - "16:21:59[ INFO][ embed_objects] - Embedding task completed: Processed 19 rows.\n", - "16:21:59[ INFO][ update_rows] - Updated 21 rows in database.\n", - "16:21:59[ INFO][ embed_objects] - Embedding task completed: Processed 20 rows.\n", - "16:21:59[ INFO][ update_rows] - Updated 20 rows in database.\n", - "16:21:59[ INFO][ embed_objects] - Embedding task completed: Processed 21 rows.\n", - "16:21:59[ INFO][ update_rows] - Updated 21 rows in database.\n", - "16:21:59[ INFO][ embed_objects] - Embedding task completed: Processed 20 rows.\n", - "16:22:00[ INFO][ update_rows] - Updated 21 rows in database.\n", - "16:22:00[ INFO][ embed_objects] - Embedding task completed: Processed 21 rows.\n", - "16:22:00[ INFO][ update_rows] - Updated 20 rows in database.\n", - "16:22:00[ INFO][ update_rows] - Updated 21 rows in database.\n", - "16:22:00[ INFO][ update_rows] - Updated 21 rows in database.\n", - "16:22:00[ INFO][ embed_objects] - Embedding task completed: Processed 20 rows.\n", - "16:22:00[ INFO][ embed_objects] - Embedding task completed: Processed 23 rows.\n", - "16:22:00[ INFO][ embed_objects] - Embedding task completed: Processed 21 rows.\n", - "16:22:00[ INFO][ update_rows] - Updated 19 rows in database.\n", - "16:22:00[ INFO][ embed_objects] - Embedding task completed: Processed 21 rows.\n", - "16:22:00[ INFO][ update_rows] - Updated 20 rows in database.\n", - "16:22:00[ INFO][ embed_objects] - Embedding task completed: Processed 20 rows.\n", - "16:22:00[ INFO][ update_rows] - Updated 21 rows in database.\n", - "16:22:00[ INFO][ embed_objects] - Embedding task completed: Processed 22 rows.\n", - "16:22:01[ INFO][ update_rows] - Updated 21 rows in database.\n", - "16:22:01[ INFO][ update_rows] - Updated 20 rows in database.\n", - "16:22:01[ INFO][ batch_data] - Yielded batch number: 404 with length: 1\n", - "16:22:01[ INFO][ embed_objects] - Embedding task completed: Processed 20 rows.\n", - "16:22:01[ INFO][ embed_objects] - Embedding task completed: Processed 21 rows.\n", - "16:22:01[ INFO][ update_rows] - Updated 21 rows in database.\n", - "16:22:01[ INFO][ embed_objects] - Embedding task completed: Processed 22 rows.\n", - "16:22:01[ INFO][ update_rows] - Updated 21 rows in database.\n", - "16:22:01[ INFO][ embed_objects] - Embedding task completed: Processed 23 rows.\n", - "16:22:01[ INFO][ update_rows] - Updated 21 rows in database.\n", - "16:22:01[ INFO][ embed_objects] - Embedding task completed: Processed 23 rows.\n", - "16:22:01[ INFO][ update_rows] - Updated 23 rows in database.\n", - "16:22:01[ INFO][ embed_objects] - Embedding task completed: Processed 20 rows.\n", - "16:22:02[ INFO][ update_rows] - Updated 20 rows in database.\n", - "16:22:02[ INFO][ embed_objects] - Embedding task completed: Processed 22 rows.\n", - "16:22:02[ INFO][ update_rows] - Updated 20 rows in database.\n", - "16:22:02[ INFO][ embed_objects] - Embedding task completed: Processed 22 rows.\n", - "16:22:02[ INFO][ update_rows] - Updated 22 rows in database.\n", - "16:22:02[ INFO][ embed_objects] - Embedding task completed: Processed 22 rows.\n", - "16:22:02[ INFO][ update_rows] - Updated 23 rows in database.\n", - "16:22:02[ INFO][ embed_objects] - Embedding task completed: Processed 1 rows.\n", - "16:22:02[ INFO][ update_rows] - Updated 22 rows in database.\n", - "16:22:03[ INFO][ update_rows] - Updated 23 rows in database.\n", - "16:22:03[ INFO][ update_rows] - Updated 20 rows in database.\n", - "16:22:03[ INFO][ update_rows] - Updated 20 rows in database.\n", - "16:22:03[ INFO][ update_rows] - Updated 22 rows in database.\n", - "16:22:03[ INFO][ update_rows] - Updated 22 rows in database.\n", - "16:22:04[ INFO][ update_rows] - Updated 1 rows in database.\n", - "16:22:04[ INFO][ update_rows] - Updated 22 rows in database.\n", - "16:22:08[ INFO][ update_rows] - Updated 20 rows in database.\n", - "16:22:08[ INFO][ update_rows] - Updated 21 rows in database.\n", - "16:22:08[ INFO][ update_rows] - All database update tasks completed.\n", - "Job started at: Thu Jan 1 05:30:30 1970\n", - "Job ended at: Thu Jan 1 05:32:16 1970\n", - "Total run time: 106.78 seconds\n", - "Total characters embedded: 23619225\n" - ] - } - ], - "source": [ - "import vertexai\n", - "import time\n", - "import asyncio\n", - "from vertexai.language_models import TextEmbeddingModel\n", - "\n", - "pool_size = 10\n", - "embed_data_concurrency = 20\n", - "batch_update_concurrency = 10\n", - "total_char_count = 0\n", - "\n", - "# Set up connections to the database\n", - "connector = AsyncConnector()\n", - "pool = await init_connection_pool(connector, database_name, pool_size=pool_size)\n", - "\n", - "# Initialise VertexAI and the model to be used to generate embeddings\n", - "vertexai.init(project=project_id, location=region)\n", - "model = TextEmbeddingModel.from_pretrained(model_name)\n", - "\n", - "start_time = time.monotonic()\n", - "\n", - "# Fetch source data from the database\n", - "source_data = get_source_data(pool, cols_to_embed)\n", - "\n", - "# Divide the source data into batches for efficient processing\n", - "batch_data = batch_source_data(source_data, cols_to_embed)\n", - "\n", - "# Generate embeddings for the batched data concurrently\n", - "embeddings_data = embed_objects_concurrently(cols_to_embed, batch_data, model, task, max_concurrency=embed_data_concurrency)\n", - "\n", - "# Update the database with the generated embeddings concurrently\n", - "await batch_update_rows_concurrently(pool, embeddings_data, cols_to_embed, max_concurrency=batch_update_concurrency)\n", - "\n", - "end_time = time.monotonic()\n", - "elapsed_time = end_time - start_time\n", - "\n", - "# Release database connections and close the connector\n", - "await pool.dispose()\n", - "await connector.close()\n", - "\n", - "print(f\"Job started at: {time.ctime(start_time)}\")\n", - "print(f\"Job ended at: {time.ctime(end_time)}\")\n", - "print(f\"Total run time: {elapsed_time:.2f} seconds\")\n", - "print(f\"Total characters embedded: {total_char_count}\")" - ] - }, - { - "cell_type": "code", - "execution_count": 41, - "metadata": { - "id": "fzZJsWRZAMxs" - }, - "outputs": [], - "source": [] - } - ], - "metadata": { - "colab": { - "provenance": [] - }, - "kernelspec": { - "display_name": "python-docs-samples", - "language": "python", - "name": "python3" - }, - "language_info": { - "codemirror_mode": { - "name": "ipython", - "version": 3 - }, - "file_extension": ".py", - "mimetype": "text/x-python", - "name": "python", - "nbconvert_exporter": "python", - "pygments_lexer": "ipython3", - "version": "3.9.16" - } - }, - "nbformat": 4, - "nbformat_minor": 0 -}