-
Notifications
You must be signed in to change notification settings - Fork 3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Batching in upload_documents()
does not work
#40157
Comments
Hello @cecheta. I'm an AI assistant for the azure-sdk-for-python repository. I have some suggestions that you can try out while the team gets back to you.
The team will get back to you shortly, hopefully this helps in the meantime. |
Somewhat related, I noticed that if you try to upload more than 32000 documents, that also produces an error of too many documents, but because it's not a Perhaps the retry could be applied to that scenario as well? |
Could you try the latest version and see if it works? |
Hi @cecheta. Thank you for opening this issue and giving us the opportunity to assist. To help our team better understand your issue and the details of your scenario please provide a response to the question asked above or the information requested above. This will help us more accurately address your issue. |
Hi @xiangyan99 , I just tried again but get this error:
It looks like |
Could you tell me why it should be removed from self._index_documents_actions()? |
Hi @cecheta. Thank you for opening this issue and giving us the opportunity to assist. To help our team better understand your issue and the details of your scenario please provide a response to the question asked above or the information requested above. This will help us more accurately address your issue. |
The variable
I believe the code should look like this: def _index_documents_actions(self, actions: List[IndexAction], **kwargs: Any) -> List[IndexingResult]:
error_map = {413: RequestEntityTooLargeError}
kwargs["headers"] = self._merge_client_headers(kwargs.get("headers"))
batch = IndexBatch(actions=actions)
try:
batch_response = self._client.documents.index(batch=batch, error_map=error_map, **kwargs)
return cast(List[IndexingResult], batch_response.results)
except RequestEntityTooLargeError:
if len(actions) == 1:
raise
pos = round(len(actions) / 2)
batch_response_first_half = self._index_documents_actions(
actions=actions[:pos], **kwargs
)
if batch_response_first_half:
result_first_half = batch_response_first_half
else:
result_first_half = []
batch_response_second_half = self._index_documents_actions(
actions=actions[pos:], **kwargs
)
if batch_response_second_half:
result_second_half = batch_response_second_half
else:
result_second_half = []
result_first_half.extend(result_second_half)
return result_first_half |
Describe the bug
I noticed that in the source code of
SearchClient.upload_documents()
, more specifically the_index_documents_actions()
function, there is meant to be a batch split and retry if the batch is too large and a413: RequestEntityTooLargeError
error is produced. However, the batch splitting doesn't work. Instead, the following error is observed:To Reproduce
Steps to reproduce the behavior:
AZURE_SEARCH_ENDPOINT
,AZURE_SEARCH_API_KEY
Expected behavior
After the first batch fails, it should split into two smaller batches and retry both.
Additional context
From what I've seen, removing
error_map=error_map
here and here seems to fix it.The text was updated successfully, but these errors were encountered: