Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Pull-based Ingestion] Add support for dynamically updating ingestion error handling strategy with minor fixes #17565

Merged

Conversation

varunbharadwaj
Copy link
Contributor

@varunbharadwaj varunbharadwaj commented Mar 11, 2025

Description

  1. This PR is a follow up for [Pull-based Ingestion] Add error handling strategy to pull-based ingestion #17427 to add support for dynamically updating ingestion error strategy using update_settings API.
  2. Message processor will indefinitely retry failed messages after a wait time, if a BLOCK error strategy is used. Updating to DROP strategy will skip the failed messages.
  3. Additionally, the PR includes minor fixes - fixes initial global checkpoint in p2p segRep mode which is validated by flows such as CloseIndex API and handle race condition on flush before poller is initialized completely.

This PR forms the base on which subsequent PRs will build on for adding pause/resume APIs.

Related Issues

Resolves part of #17442. Subsequent PRs will add pause/resume APIs.

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

Copy link
Contributor

❌ Gradle check result for 92b576e: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

@varunbharadwaj varunbharadwaj force-pushed the vb/ingestion_mgmt_api branch from 92b576e to 9f94093 Compare March 11, 2025 06:06
Copy link
Contributor

✅ Gradle check result for 9f94093: SUCCESS

Copy link

codecov bot commented Mar 11, 2025

Codecov Report

Attention: Patch coverage is 70.73171% with 12 lines in your changes missing coverage. Please review.

Project coverage is 72.43%. Comparing base (e306d51) to head (4399432).
Report is 9 commits behind head on main.

Files with missing lines Patch % Lines
...a/org/opensearch/index/engine/IngestionEngine.java 54.54% 4 Missing and 1 partial ⚠️
...ndices/pollingingest/MessageProcessorRunnable.java 72.22% 3 Missing and 2 partials ⚠️
...in/java/org/opensearch/index/shard/IndexShard.java 50.00% 1 Missing and 1 partial ⚠️
Additional details and impacted files
@@             Coverage Diff              @@
##               main   #17565      +/-   ##
============================================
+ Coverage     72.27%   72.43%   +0.15%     
- Complexity    65611    65696      +85     
============================================
  Files          5311     5311              
  Lines        304942   304975      +33     
  Branches      44225    44229       +4     
============================================
+ Hits         220407   220912     +505     
+ Misses        66448    65918     -530     
- Partials      18087    18145      +58     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@andrross
Copy link
Member

Pause message processor when an error is encountered on BLOCK error strategy.

Would it be better to keep retrying indefinitely at some fixed rate? It can be hard to determine if errors are transient or not, and in this case if you enter the paused state because of a transient failure you'll just be stuck until something intervenes, right?

@varunbharadwaj varunbharadwaj force-pushed the vb/ingestion_mgmt_api branch 2 times, most recently from 33fee6a to e6e34ed Compare March 12, 2025 19:01
@varunbharadwaj
Copy link
Contributor Author

Pause message processor when an error is encountered on BLOCK error strategy.

Would it be better to keep retrying indefinitely at some fixed rate? It can be hard to determine if errors are transient or not, and in this case if you enter the paused state because of a transient failure you'll just be stuck until something intervenes, right?

Good point, as discussed today, updated to indefinitely retry and skip only after user switches to a DROP policy.

Copy link
Contributor

❌ Gradle check result for e6e34ed: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

@varunbharadwaj varunbharadwaj force-pushed the vb/ingestion_mgmt_api branch from e6e34ed to 5c0788c Compare March 12, 2025 20:08
Copy link
Contributor

❌ Gradle check result for 5c0788c: null

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

@varunbharadwaj varunbharadwaj force-pushed the vb/ingestion_mgmt_api branch from 5c0788c to 589c235 Compare March 12, 2025 23:26
Copy link
Contributor

✅ Gradle check result for 589c235: SUCCESS

@varunbharadwaj varunbharadwaj changed the title [Pull-based Ingestion] Add support for dynamically updating ingestion error handling strategy [Pull-based Ingestion] Add support for dynamically updating ingestion error handling strategy with minor fixes Mar 14, 2025
Copy link
Contributor

❌ Gradle check result for 12bbe28: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

@varunbharadwaj varunbharadwaj force-pushed the vb/ingestion_mgmt_api branch from 12bbe28 to 4399432 Compare March 14, 2025 17:12
Copy link
Contributor

❕ Gradle check result for 4399432: UNSTABLE

  • TEST FAILURES:
      2 org.opensearch.repositories.s3.S3BlobStoreRepositoryTests.classMethod
      1 org.opensearch.snapshots.DedicatedClusterSnapshotRestoreIT.testSnapshotWithStuckNode

Please review all flaky tests that succeeded after retry and create an issue if one does not already exist to track the flaky failure.

@yupeng9
Copy link
Contributor

yupeng9 commented Mar 14, 2025

LGTM

@andrross andrross merged commit 1166998 into opensearch-project:main Mar 17, 2025
30 of 31 checks passed
vinaykpud pushed a commit to vinaykpud/OpenSearch that referenced this pull request Mar 18, 2025
… error handling strategy with minor fixes (opensearch-project#17565)

* Fix global checkpoint for p2p segrep in ingestion mode

Signed-off-by: Varun Bharadwaj <[email protected]>

* Support updating ingestion error strategy

Signed-off-by: Varun Bharadwaj <[email protected]>

* Handle race condition on calling flush before poller is initialized

Signed-off-by: Varun Bharadwaj <[email protected]>

---------

Signed-off-by: Varun Bharadwaj <[email protected]>
Signed-off-by: Vinay Krishna Pudyodu <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants