[Proposal] Write checkpoint / checkpoint request protocol changes #317

rkistner · 2025-08-05T09:57:09Z

rkistner
Aug 5, 2025
Maintainer

Summary

Rename "write checkpoints" -> "checkpoint requests" to make terminology more clear.
Add a /sync/checkpoint-request endpoint, to fix some issues currently experienced.

Status

Update 2025-08-14: Rename "write checkpoints" to "checkpoint requests"
Update 2025-08-06: Added automatic re-posting of unacknowledged checkpoints requests every 10 seconds, which allows automatic recovery from most failure scenarios.

TODO:

Spec out client-side storage.
Check how this affects the Rust client implementation.
Make a decision on backwards-compatibility approach.
Think a about a migration path for disabling old-style checkpoint requests, which currently never expire automatically.

Background

Write checkpoints are documented here. In summary, the idea is to give the client a way to wait for all uploaded data to be acknowledged and synced back, before applying the next checkpoint locally.

More generally, a write checkpoint, or "checkpoint request" in the proposed terminology, is a way to request a checkpoint that matches the server-side database state at the time of the request, or a later point.

It currently works by making a request to /write-checkpoint2.json, that returns an auto-incrementing id, scoped to the user_id + client_id combination. When the sync stream syncs back the same write checkpoint id (or greater), the client knows the changes have been acknowledged, and can apply the local changes.

The Team and Enterprise versions of the PowerSync Service also support "custom write checkpoints" for use cases where uploaded data is written to the database asynchronously, for example via a message queue. Instead of calling /write-checkpoint2.json, the client can call a custom backend API to generate a new write checkpoint id, and persist it locally using transaction.complete(writeCheckpointId). On the server, this checkpoint is asynchronously written to the source database, at which point it is picked up by the PowerSync Service and synced to the client.

Process flow

sequenceDiagram
    Client->>+SourceDB: Write changes
    SourceDB-->>-Client: Acknowledge write
    Client->>+SourceDB: Create write checkpoint
    SourceDB-->>-Client: Return write checkpoint (id=n)
    
    PowerSync-->>Client: Sync data
    SourceDB->>PowerSync: Replicate write checkpoint (id=n)
    PowerSync-->>Client: Sync data
    PowerSync->>Client: Sync Checkpoint (write checkpoint=n)

To simplify the diagram, we show direct connection between Client and SourceDB for writing changes and creating write checkpoints. In practice, this is done either via the powersync service or a custom backend API, but the sequence remains the same.

Current issues

The checkpoint is scoped to (user_id, client_id) on the service, but effectively only to (client_id) on the client. So if the user-id changes on the client, a pending write checkpoint may never be synced back, until a new one is eventually generated after uploading more data. In some cases, this could cause the client to be "stuck" in syncing for a long time.
When switching between default write checkpoints and custom write checkpoints, there could similarly be a mismatch in checkpoints, causing the client to be stuck.
The current client-side implementation requires each transaction to be completed with a write checkpoint id, unlike the default implementation that only generates one checkpoint after all transactions have been uploaded.
If an app has many anonymous/temporary users, or regularly creates new temporary databases with unique client ids, it may end up with many write checkpoints on the service. We can never clean these up, since we don't know whether a client would ever connect again and need that write checkpoint. While each individual write checkpoint is slow, this can add up over time when you have hundreds of thousands of unique users/clients.
The name "write checkpoints" does not capture its purpose accurately. While waiting for client writes to be acknowledged is the original use case, it is not the only use case. And the relationship between "checkpoints" and "write checkpoints" is confusing.

Proposal

sequenceDiagram
    Client->>+SourceDB: Write changes
    SourceDB-->>-Client: Acknowledge write
    Client->>+SourceDB: Create CheckpointRequest (id=n)
    PowerSync-->>Client: Sync data
    SourceDB->>PowerSync: Replicate CheckpointRequest (id=n)
    PowerSync-->>Client: Sync data
    PowerSync->>Client: Sync Checkpoint (request=n)

1. Rename "write checkpoints" -> "checkpoint requests"

This better describes the relationship between the two. "Checkpoint requests" are not a standalone concept like "write checkpoints" made it appear. Instead, it requests the service to provide a checkpoint that contains the current source database state, at the time the request is made (or later).

Terminology:

Checkpoint: Identifies a specific snapshot of state, as replicated from the source database to the powersync service. This may include a checkpoint request id, if the request is fullfilled by this checkpoint. (unchanged)
Checkpoint Id: The op_id associated with the checkpoint - global per powersync instance, strictly incrementing sequence. (unchanged)
Checkpoint Request, or "request" for short: A request for a new checkpoint, effectively "give me a checkpoint matching the current database state or later". Previously "write checkpoint".
Checkpoint Request Id: Client-generated id, identifying a checkpoint request. Auto-incrementing on the client side. Previously "write checkpoint id".
Requested checkpoint: A synced checkpoint with a request id

2. Client-generated request id

Let the client generate the checkpoint request number, instead of the service.

3. Create a new checkpoint request API

POST /sync/checkpoint-request
{
  client_id: string
  checkpoint_request_id: number
}

This always replaces the checkpoint request for this (user_id, client_id). If the request with the same id already exists, this is a no-op. In any other scenario, this replaces the request with a new LSN/replication head. The checkpoint request may auto-expire after a set period, say 1 day.

Syncing the checkpoint request id (previously write checkpoint id) back in the sync stream works the same as before.

This has some advantages:

The client does not have to wait for the response of the http request.
The call is idempotent - the client can retry this call any number of times, and it would only have an effect if the server hasn't seen it yet.

This combines into the main goal: this allows the client to arbitrarily call the same API again, for example every time when creating a new connection, without adding significant overhead. This would then cover cases where the user id changes, or if the service expires the request.

A general heuristic could be: Whenever the client is waiting for a requested checkpoint for more than 10 seconds (plus some random jitter), re-request it. The timer resets whenever a new request is created or when a new requested checkpoint is receveived, but not when a new normal checkpoint is received. The specifics here may change to make the client easier to implement or to tweak the timeout, as long as the general concept holds. This has the added advantage that it would become quite clear in the service logs when a client is waiting for a requested checkpoint for an extended time.

The 10 second delay here is a trade-off between:

Not adding too much overhead if there is significant replication lag on the service, in which case all clients may repeatedly re-post the requests.
Keeping the delay short enough to recover from potential error scenarios in a reasonable time.

We add some jitter to the 10-second delay to spread out requests in the case of high load.

Custom checkpoint requests

For custom checkpoint requests, we don't need the new service API (since it's bypassed), but could implement the same workflow:

Client generates the checkpoint request id, instead of the backend.
Client passes in the checkpoint request id to the backend.
Rest works the same.

To support implementing this, the connector gains a new optional method:

postCheckpointRequest(checkpointRequestId: string): Promise<void>

If this is present, it replaces the built-in checkpoint request call.

A remaining point to figure out: How do we handle APIs that want to implement the checkpoint request directly as part of the upload API, to avoid having another separate API call?

Backwards-compatibility

On the service, this is a new API call, and the existing /write-checkpoint2.json call and behavior will remain for clients not supporting the new API.

On the client, we'd need to either:

Make the feature opt-in.
Do feature-detection: Attempt the new API, falling back to the old approach if we get a 404.

Client-side storage

We could use this as an opportunity to clean-up client-side storage: Instead of storing checkpoint requests in ps_buckets (old hack), we can store in ps_kv or a new dedicated table.

Risks / Drawbacks

Since we're replacing the checkpoint request on the service instead of only keeping the largest one, we need to consider the possibility of race conditions. The following could theoretically happen in very weird network conditions:

Client posts checkpoint request 10, but the request times out on the client (network issue), or the client cancels it (disconnect + reconnect).
Client posts checkpoint request 11.
Server sees checkpoint request 11.
Server sees checkpoint request 10 after a delay in the network or load balancer, then marks this as the new active one.

While this scenario seems quite unlikely to happen, there is nothing guaranteeing that requests from the client would arrive on the service in order, unless the client received an explicit response before sending the next one.

Now, the server only syncs back checkpoint for request 10, while the client waits for 11.

The strategy above of automatically re-posting a request every 10 seconds while the client waits for it, would mean this only adds a 10 second delay. This should be a very rare case, the 10 second delay should be acceptable.

Alternatives/Variations Considered

Sync stream request

In addition to the separate API call, the sync stream could also take the current pending checkpoint request as an additional parameter. This would behave the same as making a new API call with that request, while avoiding the additional network call.

Keep server-generated ids

We could keep on using server-generated ids instead of client-generated ids, using the current API. In that case, we'd need a way to validate that the checkpoint request the client is waiting for will eventually arrive, and only generate a new request if the request is not found.

To validate the checkpoint request, we could either:

Let the server return "pending checkpoint requests" in the sync stream. If the client connects and the current checkpoint request does not appear there, it can create a new one.
Let the client pass in the current pending checkpoint request in the sync stream request, and let the service do that validation.

A big limitation here is that it doesn't cater for custom checkpoint requests at all. This also feels more complex & hacked together overall.

Keep strictly incrementing checkpoint ids

Enforcing strictly incrementing checkpoint request ids on the service would completely avoid the potential consistency issue mentioned above. However, it would also limit recovery options if the checkpoint request id does ever decrease on the client. This could theoretically happen if:

The client switches between client-generated and server-generated checkpoint request ids.
The client database is partially reset, keeping the client id but resetting the checkpoint request sequence back to 0.

So overall, it removes one failure scenario (that we already have a proposal for recovering from automatically), while introducing others (that are more tricky to solve).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Proposal] Write checkpoint / checkpoint request protocol changes #317

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 0 comments

Select a reply

Uh oh!

[Proposal] Write checkpoint / checkpoint request protocol changes #317

Uh oh!

Uh oh!

rkistner Aug 5, 2025 Maintainer

Summary

Status

Background

Process flow

Current issues

Proposal

1. Rename "write checkpoints" -> "checkpoint requests"

2. Client-generated request id

3. Create a new checkpoint request API

Custom checkpoint requests

Backwards-compatibility

Client-side storage

Risks / Drawbacks

Alternatives/Variations Considered

Sync stream request

Keep server-generated ids

Keep strictly incrementing checkpoint ids

Replies: 0 comments

rkistner
Aug 5, 2025
Maintainer