#205: adding retry mechanism for Atum agent on HTTP dispatcher level by lsulak · Pull Request #443 · AbsaOSS/atum-service

lsulak · 2026-05-22T12:17:38Z

Closes: #205

Release Notes:

Atum Service: adding DB connection timeout parameter for being able to tune this on the server side
Atum Agent: adding a simple retry mechanism onto the agent's HTTP layer and also changeable settings for timeouts on the HTTP layer (for the underlying OkHTTP client)

…his on the server side

github-actions · 2026-05-22T12:20:20Z

Report: Report: server - scala:2.13.13

Metric (instruction)	Coverage	Threshold	Status
Overall	78.06%	71.0%	✅
Changed Files	0.0%	80.0%	✅

File Path	Coverage	Threshold	Status
PostgresConfig.scala	0.0%	0.0%	✅
TransactorProvider.scala	0.0%	0.0%	✅

…uts settings can be changed via the config now

…ovements

…necks-improvements' into feature/205-db-connection-bottlenecks-improvements

… checking idempontency of each operation etc

benedeki · 2026-05-22T16:15:28Z

+#atum.dispatcher.http.timeout.read-ms=30000        # response read timeout in milliseconds (default: 30 000)
+#atum.dispatcher.http.timeout.write-ms=10000       # request write timeout in milliseconds (default: 10 000)


Sorry I am not sure, what exactly mean read and write timeout?

wow, thanks for the review & doing it so quickly even! This is used here:

val okHttpClient = new OkHttpClient.Builder() .connectTimeout(connectTimeoutMs, TimeUnit.MILLISECONDS) .readTimeout(readTimeoutMs, TimeUnit.MILLISECONDS) .writeTimeout(writeTimeoutMs, TimeUnit.MILLISECONDS) .build()

so these values are used in the OkHTTP meaning that it's all on the HTTP layer: https://www.baeldung.com/okhttp-timeouts

From the source above:

ConnectionTimeout: It defines a time period in which our client should establish a connection with a target host.

ReadTimeout: It defines a maximum time of inactivity between two data packets when waiting for the server’s response.

WriteTimeout: it defines a maximum time of inactivity between two data packets when sending the request to the server.

benedeki · 2026-05-22T16:17:17Z

 #atum.dispatcher.capture.capture-limit=1000 # 0 means no limit

+# Retry configuration for the HTTP dispatcher (all keys optional — defaults apply if absent)
+#atum.dispatcher.http.retry.max-retries=3          # number of retries after first failure (0 = no retries)


My personal option would be, to have default on 1 - generally in a healthy, stable well load balanced environment, there should not be need for more - but not a hard complain.

I would tend to go more defensive on this one.

In healthy stable environment it might be that there are almost no retries or 1 is enough, but there were some production issues and I would like the agent to be generally / by default a bit more robust.

So, basically if there are 3 potential retries and it's not used in healthy environment, there is zero cost to have configured 3. But under the load during peaks / different environments this can be beneficial.

salamonpavel

PR #443 Review — Retry Mechanism & DB Connection Timeout

Summary

This PR introduces two welcome improvements:

Agent-side HTTP retry with exponential backoff — transient 5xx responses and IOExceptions (connection resets, read timeouts) are now retried up to 3 times with jitter, instead of failing immediately.
Configurable OkHTTP timeouts & HikariCP connectionTimeout — operators can now tune connect/read/write timeouts on the agent and connection-acquire timeout on the server without code changes.

Both changes are well-implemented: config keys are optional with sensible defaults, the 409-idempotency handling for saveCheckpoint is a nice touch, and the test coverage is solid.

Why client-side retries alone are not enough

While retries improve resilience against transient failures (brief network blips, single-request timeouts), they can amplify sustained overload:

Scenario	Without retries	With retries (×3)
100 agents hit a struggling server	100 requests	Up to 400 requests
Server returns 503 under load	Agents back off (by failing)	Agents retry 3× — more load on an already-overloaded server
DB connection pool exhausted	Requests fail at HikariCP timeout	Retried requests queue up again — pool stays saturated longer

The fundamental issue: the server currently has zero admission control. It accepts unlimited concurrent requests, all of which compete for 10 HikariCP connections. Client-side retries help survive momentary blips but worsen sustained pressure.

Lowering connectionTimeout from 30s → 10s is helpful (fail-fast), but it only changes when the failure happens — it doesn't prevent the overload itself.

Recommended server-side complement

The most effective protection is a concurrency semaphore at the transactor level — a single gate that limits how many queries can be in-flight across the entire application, matching the DB connection pool size.

Why the transactor level?

All database calls in the server flow through a single HikariTransactor[Task] provided by TransactorProvider. Wrapping it with a ZIO Semaphore means:

One change, all services protected — no need to modify every controller or service class.
Cumulative limit — all endpoints share the same permit pool.
Clean backpressure — excess fibers suspend on the semaphore instead of queuing inside HikariCP, giving you the option to fail fast with a timeout.

How it works

// In TransactorProvider.layer — after creating the HikariTransactor:
import zio.Semaphore
import cats.~>

for {
  // ... existing setup ...
  xa  <- HikariTransactor.fromHikariConfig[Task](hikariConfig, ...).toScopedZIO

  // Wrap: at most maxPoolSize queries run concurrently
  sem <- Semaphore.make(postgresConfig.maxPoolSize)
  throttledXa = xa.mapK(new ~>[Task, Task] {
    def apply[A](fa: Task[A]): Task[A] = sem.withPermit(fa)
  })
} yield throttledXa

Optional: fail-fast under sustained load

Instead of queueing indefinitely, reject with a clear error when the server is saturated:

sem.withPermit(fa)
   .timeoutFail(GeneralDatabaseError("server overloaded — try again later"))(5.seconds)

This turns into an HTTP 503 at the controller level, which the agent's new retry mechanism will handle gracefully — creating a healthy feedback loop:

Server overloaded → 503 → Agent backs off with jitter → Server recovers → Next retry succeeds

Relevant ZIO documentation

Semaphore — the core concurrency primitive used here.
ZIO.timeout / timeoutFail — for fail-fast behavior.
Ref — useful if you want to track/expose current concurrency as a metric.

How client + server work together

Agent (this PR)                          Server (proposed complement)
─────────────────                        ────────────────────────────
Request fails (5xx/IOException)          Too many in-flight queries
        │                                        │
        ▼                                        ▼
Retry with exponential backoff           Semaphore rejects with 503
  + jitter (spread retries)                (fail-fast, not queue)
        │                                        │
        ▼                                        ▼
Server has time to recover               Pool connections free up
        │                                        │
        └──────── Next retry succeeds ───────────┘

The two mechanisms are complementary:

The server protects itself (sheds load it can't handle).
The agent handles the resulting 503s gracefully (retries with backoff).
Together they create a stable feedback loop instead of a cascading failure.

Bottom line

✅ This PR is a solid step — merge-worthy on its own merits.

🔧 Follow-up recommendation: add a server-side Semaphore in TransactorProvider (single-line change) to cap concurrent DB access. Without it, the retry mechanism can amplify load during sustained outages. A proof-of-concept test demonstrating this is available at server/src/test/.../ThrottledCheckpointServiceDemoTests.scala in this branch.

lsulak · 2026-05-25T11:39:05Z

Bottom line

✅ This PR is a solid step — merge-worthy on its own merits.

🔧 Follow-up recommendation: add a server-side Semaphore in TransactorProvider (single-line change) to cap concurrent DB access. Without it, the retry mechanism can amplify load during sustained outages. A proof-of-concept test demonstrating this is available at server/src/test/.../ThrottledCheckpointServiceDemoTests.scala in this branch.

I think that the agentic reviewer is a bit biased because I put all of these ideas already into the ticket. I actually had semaphore already implemented, but later I found out that the DB is actually not under the pressure at all (I checked the cluster's commit latency, read/write throughtput etc), so I removed it and simplified this PR.

And certainly the server itself is also not under pressure, the only bottleneck seems to be the max DB connections that the Hikari pool uses, which I increased. It would be even better if I increase number of pods to 4 during peaks and back to 2 during normal operations, but this is not yet pattern our devops team works with on Kubernetes. Maybe later.

Maybe the agent's side is enough for now, the server should fail faster and the agent should spread the load a bit more, and we have twice as much DB connections now: https://github.com/absa-group/BDE-k8s-infra/pull/399

…ovements

#205: adding DB connection timeout parameter for being able to tune t…

5765a3c

…his on the server side

lsulak requested review from ABLL526, MatloaItumeleng, Zejnilovic, benedeki, dk1844 and salamonpavel as code owners May 22, 2026 12:17

lsulak self-assigned this May 22, 2026

lsulak added 3 commits May 22, 2026 16:42

#205: configuring the HTTP Dispatcher with retries and also the timeo…

ab9e21b

…uts settings can be changed via the config now

#205: adding unit tests for the retry feature

0bb5ce5

Merge branch 'master' into feature/205-db-connection-bottlenecks-impr…

958747f

…ovements

lsulak changed the title ~~#205: DB connection timeout during high-burst periods~~ #205: adding retry mechanism for Atum agent on HTTP dispatcher level May 22, 2026

lsulak added 3 commits May 22, 2026 16:51

fixing scala doc

c001be0

Merge remote-tracking branch 'origin/feature/205-db-connection-bottle…

e457d49

…necks-improvements' into feature/205-db-connection-bottlenecks-improvements

retry with checkpoint creation can be still successfull / testing and…

692b844

… checking idempontency of each operation etc

benedeki reviewed May 22, 2026

View reviewed changes

salamonpavel reviewed May 25, 2026

View reviewed changes

Merge branch 'master' into feature/205-db-connection-bottlenecks-impr…

dde2cc4

…ovements

lsulak merged commit 87caa84 into master Jun 1, 2026
6 of 8 checks passed

lsulak deleted the feature/205-db-connection-bottlenecks-improvements branch June 1, 2026 15:44

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

#205: adding retry mechanism for Atum agent on HTTP dispatcher level#443

#205: adding retry mechanism for Atum agent on HTTP dispatcher level#443
lsulak merged 8 commits into
masterfrom
feature/205-db-connection-bottlenecks-improvements

lsulak commented May 22, 2026 •

edited

Loading

Uh oh!

github-actions Bot commented May 22, 2026 •

edited

Loading

Uh oh!

benedeki May 22, 2026

Uh oh!

lsulak May 25, 2026

Uh oh!

lsulak May 25, 2026

Uh oh!

benedeki May 22, 2026

Uh oh!

lsulak May 25, 2026 •

edited

Loading

Uh oh!

salamonpavel left a comment

Uh oh!

lsulak commented May 25, 2026 •

edited

Loading

Bottom line

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

		#atum.dispatcher.http.timeout.read-ms=30000 # response read timeout in milliseconds (default: 30 000)
		#atum.dispatcher.http.timeout.write-ms=10000 # request write timeout in milliseconds (default: 10 000)

Conversation

lsulak commented May 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Release Notes:

Uh oh!

github-actions Bot commented May 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

benedeki May 22, 2026

Choose a reason for hiding this comment

Uh oh!

lsulak May 25, 2026

Choose a reason for hiding this comment

Uh oh!

lsulak May 25, 2026

Choose a reason for hiding this comment

Uh oh!

benedeki May 22, 2026

Choose a reason for hiding this comment

Uh oh!

lsulak May 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

salamonpavel left a comment

Choose a reason for hiding this comment

PR #443 Review — Retry Mechanism & DB Connection Timeout

Summary

Why client-side retries alone are not enough

Recommended server-side complement

Why the transactor level?

How it works

Optional: fail-fast under sustained load

Relevant ZIO documentation

How client + server work together

Bottom line

Uh oh!

lsulak commented May 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Bottom line

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

lsulak commented May 22, 2026 •

edited

Loading

github-actions Bot commented May 22, 2026 •

edited

Loading

lsulak May 25, 2026 •

edited

Loading

lsulak commented May 25, 2026 •

edited

Loading