The Stroom K8s Operator is a Kubernetes operator written in Go and developed using the Operator SDK.
Its purpose is to simplify the deployment and operational management of a Stroom cluster in a Kubernetes environment.
This project is not related to stroom-kubernetes, which is a Helm chart for deploying a Stroom stack, including optional components like Kafka. The purpose of this Operator is to focus on the Stroom deployment and automation.
- Custom Resource Definitions (CRDs) for defining the desired state of a Stroom cluster, nodes and database
- Ability to designate dedicated
ProcessingandFrontendnodes and route event traffic appropriately - Automatic secrets management (e.g. secure database credential generation and storage)
- Simple deployment via Helm charts
- Scheduled database backups
- Stroom node audit log shipping
- Stroom node lifecycle management
- Prevent node shutdown while Stroom processing tasks are still active
- Automatic task draining during shutdown
- Rolling Stroom version upgrades
- Automatically scale the maximum tasks for each Stroom node by continually assessing average CPU usage.
The following parameters are configurable:
- Adjustment time interval (how often adjustments should be made)
- Metric sliding window (calculate the average based on the specified number of minutes)
- Minimum CPU % to keep the node above
- Maximum CPU % to keep the node below
- Minimum number of tasks allowed for the node
- Maximum number of tasks allowed for the node
If you are just looking to install the Operator and don't wish to make any changes, you can skip this section.
This project was built with the Operator SDK, which bundles Kubernetes resource manifests (such as CRDs) and custom code into a deployable format.
- Install Operator SDK and additional prerequisites
- Clone this repository
make build-offline-bundle(optional)PRIVATE_REGISTRY=my-registry.example.com
- Kubernetes cluster running version >= v1.20
- Helm >= v3.8.0
- metrics-server (pre-installed with some K8s distributions)
- Pull Helm charts for offline use. The following commands each produce a
.tar.gzfile containing the latest version of the latest Stroom operator Helm charts:helm pull oci://ghcr.io/gradata-systems/helm-charts/stroom-operator-crds helm pull oci://ghcr.io/gradata-systems/helm-charts/stroom-operator
- Pull all images in the offline image list locally. The following script block saves all images in
images.txtto a fileimages.tar.gzin the current directory:images=$(curl -s 'https://raw.githubusercontent.com/gradata-systems/stroom-k8s-operator/master/deploy/images.txt') && \ printf %s "$images" | \ while IFS= read -r line; do \ docker pull $line; \ docker image save --output=images.tar.gz $(echo $images); \ done
- Transport all downloaded archives to the airgapped environment.
- Push all container images to a private registry.
The operator requires two Helm charts to be installed, in order to function. The chart stroom-operator-crds deploys Custom Resource Definitions (CRDs), which define the structure of the custom Stroom cluster resources. The stroom-operator chart deploys the actual operator.
helm install -n stroom-operator-system --create-namespace stroom-operator-crds \
oci://ghcr.io/gradata-systems/helm-charts/stroom-operator-crdshelm install -n stroom-operator-system stroom-operator \
oci://ghcr.io/gradata-systems/helm-charts/stroom-operatorThe operator will be deployed to namespace stroom-operator-system. You can monitor its progress by watching the Pod named stroom-operator-controller-manager. Once it reaches Ready state, you can deploy a Stroom cluster.
helm install -n stroom-operator-system stroom-operator \
oci://ghcr.io/gradata-systems/helm-charts/stroom-operator \
--set registry=<private registry URL>An example Stroom cluster configuration is at ./samples, which has the following features:
- Dedicated UI node for handling user web front-end traffic. The Stroom K8s Operator disables data processing for such nodes.
- Three dedicated data processing nodes. Only these nodes receive and process event traffic.
- Persistent storage for all nodes.
- Automatic task scaling for processing nodes, which aims to achieve optimal CPU utilisation during periods of high load.
- Create a
PersistentVolumefor each Stroom node - Create
DatabaseServerresource (example: database-server.yaml) - Create
StroomClusterresource (example: stroom-cluster.yaml) - (Optional) Create
StroomTaskAutoscalerresource (example: autoscaler.yaml) - Deploy each resource
kubectl apply -f database-server.yaml kubectl apply -f stroom-cluster.yaml kubectl apply -f autoscaler.yaml
helm upgrade -n stroom-operator-system stroom-operator \
oci://ghcr.io/gradata-systems/helm-charts/stroom-operator
helm upgrade -n stroom-operator-system stroom-operator-crds \
oci://ghcr.io/gradata-systems/helm-charts/stroom-operator-crdsThis upgrades the controller in-place, without affecting any deployed Stroom clusters.
To upgrade a Stroom cluster to use a newer, tagged container image:
- Edit the
StroomClusterresource manifest (e.g.stroom-cluster.yaml), replacing the propertyspec.image.tagwith the new value. kubectl apply -f stroom-cluster.yaml- Watch the status of the
StroomClusterpods, as the Stroom K8s Operator executes a rolling upgrade of each of them. The Operator will drain each Stroom node of any processing tasks, before restarting it.
The operator can be safely removed without impacting any operational Stroom clusters. Bear in mind however, that features such as task autoscaling, will not work without the operator running.
helm uninstall -n stroom-operator-system stroom-operatorThis should only be performed once all Stroom clusters (StroomCluster resources) are deleted. This ensures that any Stroom processing tasks have had a chance to complete.
WARNING: Removing CRDs will in turn delete ALL Stroom clusters! If Stroom cluster persistent storage was configured correctly, deleting the CRDs will not result in data loss, as the PersistentVolumeClaims will remain bound.
helm uninstall -n stroom-operator-system stroom-operator-crdskubectl delete -f stroom-cluster.yaml
kubectl delete -f database-server.yamlThe order of deletion does not matter, as the DatabaseServer resource deletion will only be finalised when the parent StroomCluster is removed.
If kubectl waits for a period of time after issuing the above commands, this is normal, as the StroomCluster may be draining tasks.
After deleting a cluster, depending on the StroomCluster property spec.volumeClaimDeletePolicy, one of the following will happen:
- (Not defined) - This is the safest option and the
PersistentVolumeClaimcreated for each Stroom node remains. This means theStroomClustermay be re-deployed and eachPodwill assume the same PVC it was allocated previously. DeleteOnScaledownOnly- PVCs are deleted only when the number of nodes in aNodeSetis reduced.DeleteOnScaledownAndClusterDeletion- PVCs are deleted if theStroomClusteris deleted. Be careful with this setting, as it requires intervention afterward to unbindPersistentVolumes that were previously claimed.
A Stroom cluster may be re-deployed by re-applying the StroomCluster resource.
If a Stroom node becomes non-responsive, it may be necessary to restart its Pod. The example below deletes the first (as identified by the index #0) Stroom data node in StroomCluster named dev:
kubectl delete pod -n <namespace> stroom-dev-node-data-0As with deleting a StroomCluster resource, the Stroom K8s Operator will ensure the Pod is drained of all currently processing tasks, before allowing it to be shut down.
You can follow the stroom-operator-controller-manager Pod log to observe controller output and in particular, what actions it is performing with regard to Stroom cluster state.
- Use a version control system like Git, to manage cluster configurations.
- Backup the database secrets generated by the Stroom K8s Operator. These are stored in a
Secretresource in the same namespace as theStroomCluster, named in the convention:stroom-<cluster name>-db. The credentials for usersrootandstroomuserare contained within and deletion of thisSecretwill cause the Stroom cluster to stop functioning! - Ensure
StroomClusterpropertyspec.nodeTerminationGracePeriodSecsis set to a sufficiently large value. If your Stroom nodes typically have long-running tasks, ensure the value of this property is larger than the longest task. This will give Stroom nodes enough time to finish processing tasks before fulfilling a shutdown request. If the time interval is too short, any tasks still processing will fail. Conversely, setting this interval to too long a value, will cause non-responsive Stroom nodes to linger for extended periods of time, before being killed. - Experiment with different
StroomTaskAutoscalerparameters. A tighter CPU percentage min/max range is probably preferable, as this will make the Operator work harder to keep CPU usage in range. Bear in mind that the CPU percentages are based on a rolling average, so be careful to set a realistic upper task limit, to ensure momentary heavy load doesn't overwhelm the node. - In particularly large deployments (i.e. involving many Stroom nodes), it may be necessary to increase the resources allocated to
stroom-operator-controller-managerPod. This can be done by editing theall-in-one.yamlprior to deployment. The need for more resources is due to the Operator maintaining a finite collection ofStroomClusterPodmetrics in-memory. DatabaseServerbackups are performed as a single transaction. As this can cause issues with concurrent schema changes, Stroom upgrades (which sometimes modify the DB schema) should not be performed while a database backup is in progress.- If a Stroom
Podhangs and you do not want to wait for it to be deleted (and are comfortable accepting the risk of the loss of processing tasks), you can force its deletion by:- Deleting the
Pod(e.g. usingkubectl) - Terminating the Stroom Java process within the running container (named
stroom-node)
- Deleting the