Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Batching in upload_documents() does not work #40157

Open
cecheta opened this issue Mar 20, 2025 · 8 comments
Open

Batching in upload_documents() does not work #40157

cecheta opened this issue Mar 20, 2025 · 8 comments
Assignees
Labels
Client This issue points to a problem in the data-plane of the library. needs-team-attention Workflow: This issue needs attention from Azure service team or SDK team Search

Comments

@cecheta
Copy link
Member

cecheta commented Mar 20, 2025

  • Package Name: azure-search-documents
  • Package Version: 11.6.0b10
  • Operating System: Windows (WSL2)
  • Python Version: 3.12.9

Describe the bug
I noticed that in the source code of SearchClient.upload_documents(), more specifically the _index_documents_actions() function, there is meant to be a batch split and retry if the batch is too large and a 413: RequestEntityTooLargeError error is produced. However, the batch splitting doesn't work. Instead, the following error is observed:

Traceback (most recent call last):
  File "/home/user/Code/playground/.venv/lib/python3.12/site-packages/azure/search/documents/_search_client.py", line 703, in _index_documents_actions
    batch_response = self._client.documents.index(batch=batch, error_map=error_map, **kwargs)
                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/user/Code/playground/.venv/lib/python3.12/site-packages/azure/core/tracing/decorator.py", line 105, in wrapper_use_tracer
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/home/user/Code/playground/.venv/lib/python3.12/site-packages/azure/search/documents/_generated/operations/_documents_operations.py", line 1232, in index
    map_error(status_code=response.status_code, response=response, error_map=error_map)
  File "/home/user/Code/playground/.venv/lib/python3.12/site-packages/azure/core/exceptions.py", line 163, in map_error
    raise error
azure.search.documents._search_documents_error.RequestEntityTooLargeError: Operation returned an invalid status 'Request Entity Too Large'
Content: The page was not displayed because the request entity is too large.

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/user/Code/playground/main.py", line 37, in <module>
    client.get_search_client("my-index").upload_documents(documents=documents)
  File "/home/user/Code/playground/.venv/lib/python3.12/site-packages/azure/search/documents/_search_client.py", line 596, in upload_documents
    results = self.index_documents(batch, **kwargs)
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/user/Code/playground/.venv/lib/python3.12/site-packages/azure/core/tracing/decorator.py", line 105, in wrapper_use_tracer
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/home/user/Code/playground/.venv/lib/python3.12/site-packages/azure/search/documents/_search_client.py", line 695, in index_documents
    return self._index_documents_actions(actions=batch.actions, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/user/Code/playground/.venv/lib/python3.12/site-packages/azure/search/documents/_search_client.py", line 709, in _index_documents_actions
    batch_response_first_half = self._index_documents_actions(
                                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/user/Code/playground/.venv/lib/python3.12/site-packages/azure/search/documents/_search_client.py", line 703, in _index_documents_actions
    batch_response = self._client.documents.index(batch=batch, error_map=error_map, **kwargs)
                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
TypeError: azure.search.documents._generated.operations._documents_operations.DocumentsOperations.index() got multiple values for keyword argument 'error_map'

To Reproduce
Steps to reproduce the behavior:

  1. Create AI Search resource
  2. Set the following env vars: AZURE_SEARCH_ENDPOINT, AZURE_SEARCH_API_KEY
  3. Run the following code:
import os

from azure.core.credentials import AzureKeyCredential
from azure.search.documents.indexes import SearchIndexClient
from azure.search.documents.indexes.models import (
    SearchFieldDataType,
    SearchIndex,
    SimpleField,
)

client = SearchIndexClient(
    os.getenv("AZURE_SEARCH_ENDPOINT"),
    AzureKeyCredential(os.getenv("AZURE_SEARCH_API_KEY")),
)

index_name = "my-index"

client.create_or_update_index(
    SearchIndex(
        name=index_name,
        fields=[
            SimpleField(name="id", type=SearchFieldDataType.String, key=True),
            SimpleField(name="content", type=SearchFieldDataType.String),
        ],
    )
)

documents = [{"id": str(i), "content": " " * 100000} for i in range(10000)]

client.get_search_client(index_name).upload_documents(documents=documents)

Expected behavior
After the first batch fails, it should split into two smaller batches and retry both.

Additional context
From what I've seen, removing error_map=error_map here and here seems to fix it.

@github-actions github-actions bot added Client This issue points to a problem in the data-plane of the library. needs-team-attention Workflow: This issue needs attention from Azure service team or SDK team Search labels Mar 20, 2025
Copy link

Hello @cecheta. I'm an AI assistant for the azure-sdk-for-python repository. I have some suggestions that you can try out while the team gets back to you.

  • The error is caused by passing the error_map keyword argument twice during recursive batch splitting, which results in a TypeError.
  • A potential workaround is to remove error_map from the recursive calls as you noted in the source at line 710 and line 717.
  • Consider opening a PR or filing an issue with these observations so that the batch-splitting logic can be updated to correctly handle the error_map parameter during retries.
  • Verify if this behavior occurs consistently with large documents triggering a 413: RequestEntityTooLargeError to confirm the fix.

The team will get back to you shortly, hopefully this helps in the meantime.

@cecheta
Copy link
Member Author

cecheta commented Mar 20, 2025

Somewhat related, I noticed that if you try to upload more than 32000 documents, that also produces an error of too many documents, but because it's not a 413 RequestEntityTooLargeError, that doesn't get caught.

Perhaps the retry could be applied to that scenario as well?

@xiangyan99
Copy link
Member

Could you try the latest version and see if it works?

https://pypi.org/project/azure-search-documents/11.6.0b11/

@xiangyan99 xiangyan99 added the needs-author-feedback Workflow: More information is needed from author to address the issue. label Mar 25, 2025
@github-actions github-actions bot removed the needs-team-attention Workflow: This issue needs attention from Azure service team or SDK team label Mar 25, 2025
Copy link

Hi @cecheta. Thank you for opening this issue and giving us the opportunity to assist. To help our team better understand your issue and the details of your scenario please provide a response to the question asked above or the information requested above. This will help us more accurately address your issue.

@cecheta
Copy link
Member Author

cecheta commented Mar 26, 2025

Hi @xiangyan99 , I just tried again but get this error:

Traceback (most recent call last):
  File "/home/user/Code/playground/app.py", line 30, in <module>
    client.get_search_client(index_name).upload_documents(documents=documents)
  File "/home/user/Code/playground/.venv/lib/python3.12/site-packages/azure/search/documents/_search_client.py", line 596, in upload_documents
    results = self.index_documents(batch, **kwargs)
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/user/Code/playground/.venv/lib/python3.12/site-packages/azure/core/tracing/decorator.py", line 105, in wrapper_use_tracer
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/home/user/Code/playground/.venv/lib/python3.12/site-packages/azure/search/documents/_search_client.py", line 695, in index_documents
    return self._index_documents_actions(actions=batch.actions, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/user/Code/playground/.venv/lib/python3.12/site-packages/azure/search/documents/_search_client.py", line 703, in _index_documents_actions
    batch_response = self._client.documents.index(batch=batch, **kwargs)
                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/user/Code/playground/.venv/lib/python3.12/site-packages/azure/core/tracing/decorator.py", line 105, in wrapper_use_tracer
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/home/user/Code/playground/.venv/lib/python3.12/site-packages/azure/search/documents/_generated/operations/_documents_operations.py", line 1234, in index
    raise HttpResponseError(response=response, model=error)
azure.core.exceptions.HttpResponseError: Operation returned an invalid status 'Request Entity Too Large'
Content: The page was not displayed because the request entity is too large.

It looks like error_map=error_map was removed from the self._client.documents.index() call, when it should have been removed from the two recursive self._index_documents_actions() calls.

@github-actions github-actions bot added needs-team-attention Workflow: This issue needs attention from Azure service team or SDK team and removed needs-author-feedback Workflow: More information is needed from author to address the issue. labels Mar 26, 2025
@xiangyan99
Copy link
Member

Could you tell me why it should be removed from self._index_documents_actions()?

@xiangyan99 xiangyan99 added the needs-author-feedback Workflow: More information is needed from author to address the issue. label Apr 1, 2025
@github-actions github-actions bot removed the needs-team-attention Workflow: This issue needs attention from Azure service team or SDK team label Apr 1, 2025
Copy link

github-actions bot commented Apr 1, 2025

Hi @cecheta. Thank you for opening this issue and giving us the opportunity to assist. To help our team better understand your issue and the details of your scenario please provide a response to the question asked above or the information requested above. This will help us more accurately address your issue.

@cecheta
Copy link
Member Author

cecheta commented Apr 1, 2025

Could you tell me why it should be removed from self._index_documents_actions()?

The variable error_map = {413: RequestEntityTooLargeError} needs to be passed to the self._client.documents.index() function, so that the RequestEntityTooLargeError exception can be raised when there is a 413 status code and the client can retry.

error_map is a local variable to the _index_documents_actions() function, and therefore does not need to be included as a keyword argument when the function is called recursively. Instead, it should be passed to self._client.documents.index().

I believe the code should look like this:

    def _index_documents_actions(self, actions: List[IndexAction], **kwargs: Any) -> List[IndexingResult]:
        error_map = {413: RequestEntityTooLargeError}

        kwargs["headers"] = self._merge_client_headers(kwargs.get("headers"))
        batch = IndexBatch(actions=actions)
        try:
            batch_response = self._client.documents.index(batch=batch, error_map=error_map, **kwargs)
            return cast(List[IndexingResult], batch_response.results)
        except RequestEntityTooLargeError:
            if len(actions) == 1:
                raise
            pos = round(len(actions) / 2)
            batch_response_first_half = self._index_documents_actions(
                actions=actions[:pos], **kwargs
            )
            if batch_response_first_half:
                result_first_half = batch_response_first_half
            else:
                result_first_half = []
            batch_response_second_half = self._index_documents_actions(
                actions=actions[pos:], **kwargs
            )
            if batch_response_second_half:
                result_second_half = batch_response_second_half
            else:
                result_second_half = []
            result_first_half.extend(result_second_half)
            return result_first_half

@github-actions github-actions bot added needs-team-attention Workflow: This issue needs attention from Azure service team or SDK team and removed needs-author-feedback Workflow: More information is needed from author to address the issue. labels Apr 1, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Client This issue points to a problem in the data-plane of the library. needs-team-attention Workflow: This issue needs attention from Azure service team or SDK team Search
Projects
None yet
Development

No branches or pull requests

2 participants