Skip to content

#205: adding retry mechanism for Atum agent on HTTP dispatcher level#443

Merged
lsulak merged 8 commits into
masterfrom
feature/205-db-connection-bottlenecks-improvements
Jun 1, 2026
Merged

#205: adding retry mechanism for Atum agent on HTTP dispatcher level#443
lsulak merged 8 commits into
masterfrom
feature/205-db-connection-bottlenecks-improvements

Conversation

@lsulak

@lsulak lsulak commented May 22, 2026

Copy link
Copy Markdown
Collaborator

Closes: #205

Release Notes:

  • Atum Service: adding DB connection timeout parameter for being able to tune this on the server side
  • Atum Agent: adding a simple retry mechanism onto the agent's HTTP layer and also changeable settings for timeouts on the HTTP layer (for the underlying OkHTTP client)

@github-actions

github-actions Bot commented May 22, 2026

Copy link
Copy Markdown

Report: Report: server - scala:2.13.13

Metric (instruction) Coverage Threshold Status
Overall 78.06% 71.0%
Changed Files 0.0% 80.0%
File Path Coverage Threshold Status
PostgresConfig.scala 0.0% 0.0%
TransactorProvider.scala 0.0% 0.0%

@lsulak lsulak changed the title #205: DB connection timeout during high-burst periods #205: adding retry mechanism for Atum agent on HTTP dispatcher level May 22, 2026
lsulak added 3 commits May 22, 2026 16:51
…necks-improvements' into feature/205-db-connection-bottlenecks-improvements
… checking idempontency of each operation etc
Comment on lines +30 to +31
#atum.dispatcher.http.timeout.read-ms=30000 # response read timeout in milliseconds (default: 30 000)
#atum.dispatcher.http.timeout.write-ms=10000 # request write timeout in milliseconds (default: 10 000)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry I am not sure, what exactly mean read and write timeout?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

wow, thanks for the review & doing it so quickly even! This is used here:

    val okHttpClient = new OkHttpClient.Builder()
      .connectTimeout(connectTimeoutMs, TimeUnit.MILLISECONDS)
      .readTimeout(readTimeoutMs, TimeUnit.MILLISECONDS)
      .writeTimeout(writeTimeoutMs, TimeUnit.MILLISECONDS)
      .build()

so these values are used in the OkHTTP meaning that it's all on the HTTP layer: https://www.baeldung.com/okhttp-timeouts

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From the source above:

ConnectionTimeout: It defines a time period in which our client should establish a connection with a target host.

ReadTimeout: It defines a maximum time of inactivity between two data packets when waiting for the server’s response.

WriteTimeout: it defines a maximum time of inactivity between two data packets when sending the request to the server.

#atum.dispatcher.capture.capture-limit=1000 # 0 means no limit

# Retry configuration for the HTTP dispatcher (all keys optional — defaults apply if absent)
#atum.dispatcher.http.retry.max-retries=3 # number of retries after first failure (0 = no retries)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My personal option would be, to have default on 1 - generally in a healthy, stable well load balanced environment, there should not be need for more - but not a hard complain.

@lsulak lsulak May 25, 2026

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would tend to go more defensive on this one.

In healthy stable environment it might be that there are almost no retries or 1 is enough, but there were some production issues and I would like the agent to be generally / by default a bit more robust.

So, basically if there are 3 potential retries and it's not used in healthy environment, there is zero cost to have configured 3. But under the load during peaks / different environments this can be beneficial.

@salamonpavel salamonpavel left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PR #443 Review — Retry Mechanism & DB Connection Timeout

Summary

This PR introduces two welcome improvements:

  1. Agent-side HTTP retry with exponential backoff — transient 5xx responses and IOExceptions (connection resets, read timeouts) are now retried up to 3 times with jitter, instead of failing immediately.
  2. Configurable OkHTTP timeouts & HikariCP connectionTimeout — operators can now tune connect/read/write timeouts on the agent and connection-acquire timeout on the server without code changes.

Both changes are well-implemented: config keys are optional with sensible defaults, the 409-idempotency handling for saveCheckpoint is a nice touch, and the test coverage is solid.


Why client-side retries alone are not enough

While retries improve resilience against transient failures (brief network blips, single-request timeouts), they can amplify sustained overload:

Scenario Without retries With retries (×3)
100 agents hit a struggling server 100 requests Up to 400 requests
Server returns 503 under load Agents back off (by failing) Agents retry 3× — more load on an already-overloaded server
DB connection pool exhausted Requests fail at HikariCP timeout Retried requests queue up again — pool stays saturated longer

The fundamental issue: the server currently has zero admission control. It accepts unlimited concurrent requests, all of which compete for 10 HikariCP connections. Client-side retries help survive momentary blips but worsen sustained pressure.

Lowering connectionTimeout from 30s → 10s is helpful (fail-fast), but it only changes when the failure happens — it doesn't prevent the overload itself.


Recommended server-side complement

The most effective protection is a concurrency semaphore at the transactor level — a single gate that limits how many queries can be in-flight across the entire application, matching the DB connection pool size.

Why the transactor level?

All database calls in the server flow through a single HikariTransactor[Task] provided by TransactorProvider. Wrapping it with a ZIO Semaphore means:

  • One change, all services protected — no need to modify every controller or service class.
  • Cumulative limit — all endpoints share the same permit pool.
  • Clean backpressure — excess fibers suspend on the semaphore instead of queuing inside HikariCP, giving you the option to fail fast with a timeout.

How it works

// In TransactorProvider.layer — after creating the HikariTransactor:
import zio.Semaphore
import cats.~>

for {
  // ... existing setup ...
  xa  <- HikariTransactor.fromHikariConfig[Task](hikariConfig, ...).toScopedZIO

  // Wrap: at most maxPoolSize queries run concurrently
  sem <- Semaphore.make(postgresConfig.maxPoolSize)
  throttledXa = xa.mapK(new ~>[Task, Task] {
    def apply[A](fa: Task[A]): Task[A] = sem.withPermit(fa)
  })
} yield throttledXa

Optional: fail-fast under sustained load

Instead of queueing indefinitely, reject with a clear error when the server is saturated:

sem.withPermit(fa)
   .timeoutFail(GeneralDatabaseError("server overloaded — try again later"))(5.seconds)

This turns into an HTTP 503 at the controller level, which the agent's new retry mechanism will handle gracefully — creating a healthy feedback loop:

Server overloaded → 503 → Agent backs off with jitter → Server recovers → Next retry succeeds

Relevant ZIO documentation

  • Semaphore — the core concurrency primitive used here.
  • ZIO.timeout / timeoutFail — for fail-fast behavior.
  • Ref — useful if you want to track/expose current concurrency as a metric.

How client + server work together

Agent (this PR)                          Server (proposed complement)
─────────────────                        ────────────────────────────
Request fails (5xx/IOException)          Too many in-flight queries
        │                                        │
        ▼                                        ▼
Retry with exponential backoff           Semaphore rejects with 503
  + jitter (spread retries)                (fail-fast, not queue)
        │                                        │
        ▼                                        ▼
Server has time to recover               Pool connections free up
        │                                        │
        └──────── Next retry succeeds ───────────┘

The two mechanisms are complementary:

  • The server protects itself (sheds load it can't handle).
  • The agent handles the resulting 503s gracefully (retries with backoff).
  • Together they create a stable feedback loop instead of a cascading failure.

Bottom line

This PR is a solid step — merge-worthy on its own merits.

🔧 Follow-up recommendation: add a server-side Semaphore in TransactorProvider (single-line change) to cap concurrent DB access. Without it, the retry mechanism can amplify load during sustained outages. A proof-of-concept test demonstrating this is available at server/src/test/.../ThrottledCheckpointServiceDemoTests.scala in this branch.

@lsulak

lsulak commented May 25, 2026

Copy link
Copy Markdown
Collaborator Author

Bottom line

This PR is a solid step — merge-worthy on its own merits.

🔧 Follow-up recommendation: add a server-side Semaphore in TransactorProvider (single-line change) to cap concurrent DB access. Without it, the retry mechanism can amplify load during sustained outages. A proof-of-concept test demonstrating this is available at server/src/test/.../ThrottledCheckpointServiceDemoTests.scala in this branch.

I think that the agentic reviewer is a bit biased because I put all of these ideas already into the ticket. I actually had semaphore already implemented, but later I found out that the DB is actually not under the pressure at all (I checked the cluster's commit latency, read/write throughtput etc), so I removed it and simplified this PR.

And certainly the server itself is also not under pressure, the only bottleneck seems to be the max DB connections that the Hikari pool uses, which I increased. It would be even better if I increase number of pods to 4 during peaks and back to 2 during normal operations, but this is not yet pattern our devops team works with on Kubernetes. Maybe later.

Maybe the agent's side is enough for now, the server should fail faster and the agent should spread the load a bit more, and we have twice as much DB connections now: https://github.com/absa-group/BDE-k8s-infra/pull/399

@lsulak lsulak merged commit 87caa84 into master Jun 1, 2026
6 of 8 checks passed
@lsulak lsulak deleted the feature/205-db-connection-bottlenecks-improvements branch June 1, 2026 15:44
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Release process is broken

3 participants