#205: adding retry mechanism for Atum agent on HTTP dispatcher level#443
Conversation
…his on the server side
|
Report: Report: server - scala:2.13.13
|
…uts settings can be changed via the config now
…necks-improvements' into feature/205-db-connection-bottlenecks-improvements
… checking idempontency of each operation etc
| #atum.dispatcher.http.timeout.read-ms=30000 # response read timeout in milliseconds (default: 30 000) | ||
| #atum.dispatcher.http.timeout.write-ms=10000 # request write timeout in milliseconds (default: 10 000) |
There was a problem hiding this comment.
Sorry I am not sure, what exactly mean read and write timeout?
There was a problem hiding this comment.
wow, thanks for the review & doing it so quickly even! This is used here:
val okHttpClient = new OkHttpClient.Builder()
.connectTimeout(connectTimeoutMs, TimeUnit.MILLISECONDS)
.readTimeout(readTimeoutMs, TimeUnit.MILLISECONDS)
.writeTimeout(writeTimeoutMs, TimeUnit.MILLISECONDS)
.build()so these values are used in the OkHTTP meaning that it's all on the HTTP layer: https://www.baeldung.com/okhttp-timeouts
There was a problem hiding this comment.
From the source above:
ConnectionTimeout: It defines a time period in which our client should establish a connection with a target host.
ReadTimeout: It defines a maximum time of inactivity between two data packets when waiting for the server’s response.
WriteTimeout: it defines a maximum time of inactivity between two data packets when sending the request to the server.
| #atum.dispatcher.capture.capture-limit=1000 # 0 means no limit | ||
|
|
||
| # Retry configuration for the HTTP dispatcher (all keys optional — defaults apply if absent) | ||
| #atum.dispatcher.http.retry.max-retries=3 # number of retries after first failure (0 = no retries) |
There was a problem hiding this comment.
My personal option would be, to have default on 1 - generally in a healthy, stable well load balanced environment, there should not be need for more - but not a hard complain.
There was a problem hiding this comment.
I would tend to go more defensive on this one.
In healthy stable environment it might be that there are almost no retries or 1 is enough, but there were some production issues and I would like the agent to be generally / by default a bit more robust.
So, basically if there are 3 potential retries and it's not used in healthy environment, there is zero cost to have configured 3. But under the load during peaks / different environments this can be beneficial.
salamonpavel
left a comment
There was a problem hiding this comment.
PR #443 Review — Retry Mechanism & DB Connection Timeout
Summary
This PR introduces two welcome improvements:
- Agent-side HTTP retry with exponential backoff — transient 5xx responses and
IOExceptions (connection resets, read timeouts) are now retried up to 3 times with jitter, instead of failing immediately. - Configurable OkHTTP timeouts & HikariCP
connectionTimeout— operators can now tune connect/read/write timeouts on the agent and connection-acquire timeout on the server without code changes.
Both changes are well-implemented: config keys are optional with sensible defaults, the 409-idempotency handling for saveCheckpoint is a nice touch, and the test coverage is solid.
Why client-side retries alone are not enough
While retries improve resilience against transient failures (brief network blips, single-request timeouts), they can amplify sustained overload:
| Scenario | Without retries | With retries (×3) |
|---|---|---|
| 100 agents hit a struggling server | 100 requests | Up to 400 requests |
| Server returns 503 under load | Agents back off (by failing) | Agents retry 3× — more load on an already-overloaded server |
| DB connection pool exhausted | Requests fail at HikariCP timeout | Retried requests queue up again — pool stays saturated longer |
The fundamental issue: the server currently has zero admission control. It accepts unlimited concurrent requests, all of which compete for 10 HikariCP connections. Client-side retries help survive momentary blips but worsen sustained pressure.
Lowering connectionTimeout from 30s → 10s is helpful (fail-fast), but it only changes when the failure happens — it doesn't prevent the overload itself.
Recommended server-side complement
The most effective protection is a concurrency semaphore at the transactor level — a single gate that limits how many queries can be in-flight across the entire application, matching the DB connection pool size.
Why the transactor level?
All database calls in the server flow through a single HikariTransactor[Task] provided by TransactorProvider. Wrapping it with a ZIO Semaphore means:
- One change, all services protected — no need to modify every controller or service class.
- Cumulative limit — all endpoints share the same permit pool.
- Clean backpressure — excess fibers suspend on the semaphore instead of queuing inside HikariCP, giving you the option to fail fast with a timeout.
How it works
// In TransactorProvider.layer — after creating the HikariTransactor:
import zio.Semaphore
import cats.~>
for {
// ... existing setup ...
xa <- HikariTransactor.fromHikariConfig[Task](hikariConfig, ...).toScopedZIO
// Wrap: at most maxPoolSize queries run concurrently
sem <- Semaphore.make(postgresConfig.maxPoolSize)
throttledXa = xa.mapK(new ~>[Task, Task] {
def apply[A](fa: Task[A]): Task[A] = sem.withPermit(fa)
})
} yield throttledXaOptional: fail-fast under sustained load
Instead of queueing indefinitely, reject with a clear error when the server is saturated:
sem.withPermit(fa)
.timeoutFail(GeneralDatabaseError("server overloaded — try again later"))(5.seconds)This turns into an HTTP 503 at the controller level, which the agent's new retry mechanism will handle gracefully — creating a healthy feedback loop:
Server overloaded → 503 → Agent backs off with jitter → Server recovers → Next retry succeeds
Relevant ZIO documentation
Semaphore— the core concurrency primitive used here.ZIO.timeout/timeoutFail— for fail-fast behavior.Ref— useful if you want to track/expose current concurrency as a metric.
How client + server work together
Agent (this PR) Server (proposed complement)
───────────────── ────────────────────────────
Request fails (5xx/IOException) Too many in-flight queries
│ │
▼ ▼
Retry with exponential backoff Semaphore rejects with 503
+ jitter (spread retries) (fail-fast, not queue)
│ │
▼ ▼
Server has time to recover Pool connections free up
│ │
└──────── Next retry succeeds ───────────┘
The two mechanisms are complementary:
- The server protects itself (sheds load it can't handle).
- The agent handles the resulting 503s gracefully (retries with backoff).
- Together they create a stable feedback loop instead of a cascading failure.
Bottom line
✅ This PR is a solid step — merge-worthy on its own merits.
🔧 Follow-up recommendation: add a server-side Semaphore in TransactorProvider (single-line change) to cap concurrent DB access. Without it, the retry mechanism can amplify load during sustained outages. A proof-of-concept test demonstrating this is available at server/src/test/.../ThrottledCheckpointServiceDemoTests.scala in this branch.
I think that the agentic reviewer is a bit biased because I put all of these ideas already into the ticket. I actually had semaphore already implemented, but later I found out that the DB is actually not under the pressure at all (I checked the cluster's commit latency, read/write throughtput etc), so I removed it and simplified this PR. And certainly the server itself is also not under pressure, the only bottleneck seems to be the max DB connections that the Hikari pool uses, which I increased. It would be even better if I increase number of pods to 4 during peaks and back to 2 during normal operations, but this is not yet pattern our devops team works with on Kubernetes. Maybe later. Maybe the agent's side is enough for now, the server should fail faster and the agent should spread the load a bit more, and we have twice as much DB connections now: https://github.com/absa-group/BDE-k8s-infra/pull/399 |
Closes: #205
Release Notes: