Skip to content

Conversation

@jsmolar
Copy link
Contributor

@jsmolar jsmolar commented Sep 30, 2025

Fixes #8806

Remove dependency on the persistent HyTrust KMIP server

This change replaces the persistent HyTrust KMIP server with a per-job PyKMIP server deployed on the monitoring node. This setup is similar to Scylla Manager, which is also deployed there.

Additionally, this commit adds a wait step during Scylla startup in BaseScyllaCluster, since Scylla requires the KMIP server to be running before it can start.

I had to make some modifications to PyKMIP, since the upstream repository is no longer maintained: https://github.com/jsmolar/pykmip. These include adding PostgreSQL support and several minor adjustments.

On the monitoring node, PyKMIP is deployed via docker-compose, which launches 10 PyKMIP instances behind a single HAProxy load balancer, all backed by PostgreSQL.

Because PyKMIP is not designed for enterprise-scale load, this approach provides better scalability and reliability under high demand. It also eliminates the need for an expensive license and a permanently running HyTrust KMIP server.

Testing

PR pre-checks (self review)

  • I added the relevant backport labels
  • I didn't leave commented-out/debugging code

Reminders

  • Add New configuration option and document them (in sdcm/sct_config.py)
  • Add unit tests to cover my changes (under unit-test/ folder)
  • Update the Readme/doc folder relevant to this change (if needed)

Remove dependency on persistent HyTrust KMIP server

This change replaces the persistent HyTrust KMIP server with a
per-job PyKMIP server deployed on the monitoring node. This setup
is similar to Scylla Manager, which is also deployed there.

Additionally, this commit adds a wait step during Scylla startup
in BaseScyllaCluster, since Scylla requires the KMIP server to be
running before it can start.
@jsmolar
Copy link
Contributor Author

jsmolar commented Sep 30, 2025

Please note that this is just a proposal.

  • All certificates will be moved to S3 once we decide to proceed with this implementation.
  • I created a new snapshot for the backup and restore nemesis, as I encountered issues when restoring from the 2024 snapshots (an issue for this should already exist).
  • MgmtRestore nemesis - I had to disable this nemesis in testing, since it was created with KMS encryption, which cannot be used with clusters running KMIP encryption (can you confirm if this is correct?).
  • A new PyKMIP image and docker-compose.yaml are currently in my Docker Hub and Git repo, respectively. This is not the correct long-term approach, but I need guidance on where these should be placed.

@jsmolar jsmolar marked this pull request as draft October 1, 2025 10:21
@jsmolar
Copy link
Contributor Author

jsmolar commented Oct 1, 2025

@scylladb/qa-maintainers could you please check if this solution looks correct and share your opinion on it?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why this file is being added into the source ?

self.log.debug("Using round_robin for multiple Keyspaces...")
for i in range(1, keyspace_num + 1):
keyspace_name = self._get_keyspace_name(i)
keyspace_name = "10gb_sizetiered_2025_3"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this for testing only, I guess ?, please remove (or put on it's own commit)

@raise_event_on_failure
def node_startup(_node: BaseNode, task_queue: queue.Queue):
exception_details = None
node.remoter
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

???

node.remoter.run("cd /home/ubuntu/pykmip && "
"curl -O https://raw.githubusercontent.com/jsmolar/PyKMIP/master/docker-compose.yaml")
node.remoter.sudo("chmod 777 /home/ubuntu/pykmip/data/logs")
node.remoter.run("cd /home/ubuntu/pykmip && docker compose up -d --scale pykmip=10")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

10 ? do we really need that many instances ?

self.install_scylla_manager(node)
self.install_pykmip(node)

def install_pykmip(self, node):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this one doesn't just install, but also run it

I'm not sure it is wise to set it up on the monitor node, I would suggest putting it on the test node (i.e. the SCT runner)

we already have one example of docker-compose based setup running (kafka-stack-docker-compose)
I would suggest align it with that implementation

Copy link
Contributor

@fruch fruch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • Move https://github.com/jsmolar/pykmip into scylla org.
  • move installation into sct-runner
  • why we need HA for it ? it's not that we gonna use this setup for multiple tests at the same time ?
  • remove test related change out of the PR


# 1GB dataset
prepare_write_cmd: ["cassandra-stress write cl=ONE n=1048576 -schema 'replication(strategy=NetworkTopologyStrategy,replication_factor=3) compaction(strategy=LeveledCompactionStrategy)' -mode cql3 native -rate threads=50 -col 'size=FIXED(1024) n=FIXED(1)' -pop seq=1..1048576"]
prepare_write_cmd: ["cassandra-stress write cl=ONE n=1048576 -schema 'keyspace=10gb_sizetiered_2025_3 replication(strategy=NetworkTopologyStrategy,replication_factor=3) compaction(strategy=LeveledCompactionStrategy)' -mode cql3 native -rate threads=50 -col 'size=FIXED(1024) n=FIXED(1)' -pop seq=1..1048576"]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please avoid hardcoding the keyspace name in yaml test-case which is supposed to be used across several tests.
Also, there some issues with current implementation:

  • C-S operation will write 1Gb of data only, not 10Gb as mentioned in keyspace name
  • consecutive C-S read would fail as it doesn't expect 10gb_sizetiered_2025_3 keyspace

If your idea was to prepare some backup snapshot for later utilization, please, take a look at the test cases yamls from here https://github.com/scylladb/scylla-cluster-tests/tree/master/test-cases/manager/prepare_snapshot

@fruch
Copy link
Contributor

fruch commented Oct 28, 2025

Please note that this is just a proposal.

  • All certificates will be moved to S3 once we decide to proceed with this implementation.

Why static keys at all ? Can we generate fresh key on each run like do for client encryption?

  • I created a new snapshot for the backup and restore nemesis, as I encountered issues when restoring from the 2024 snapshots (an issue for this should already exist).
  • MgmtRestore nemesis - I had to disable this nemesis in testing, since it was created with KMS encryption, which cannot be used with clusters running KMIP encryption (can you confirm if this is correct?).
  • A new PyKMIP image and docker-compose.yaml are currently in my Docker Hub and Git repo, respectively. This is not the correct long-term approach, but I need guidance on where these should be placed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Enterprise KMIP tests are not working

3 participants