Crate version update in K8S not fully automated?

DarinVenkov · March 11, 2025, 10:52am

Hi all,

We have a Kubernetes cluster with a crate-operator (version 2.45.0) and a Crate cluster with 3 instances created from a CrateDB resource.

As a result in the cluster we have one pod for the crate-operator (created from a ReplicaSet of a Deployment) and three pods for the CrateDB cluster (created from a StatefulSet of the CrateDB resource).

The issue we are having is that when we update the version of the CrateDB cluster in the CrateDB resource the crate-operator picks this up and applies this change to the StatefulSet as well but the three pods of the CrateDB cluster are left unchanged still with the previous version.

Looking at the log of the crate-operator the update finished successfully:

[2025-03-10 13:45:11,380] kopf.objects         [INFO    ] [control-plane/cluster] Webhooks not configured. Not processing event WebhookEvent.HEALTH.
[2025-03-10 13:45:11,380] kopf.objects         [INFO    ] [control-plane/cluster] Timer 'ping_cratedb' succeeded.
[2025-03-10 13:45:15,956] kopf.objects         [INFO    ] [control-plane/cluster] Handler 'cluster_update/before_cluster_update/disable_cronjob' succeeded.
[2025-03-10 13:45:16,076] kopf.objects         [INFO    ] [control-plane/cluster] Timer 'enable_cronjob_after_delay/status.delay_cronjob' succeeded.
[2025-03-10 13:45:16,090] kopf.objects         [ERROR   ] [control-plane/cluster] Handler 'cluster_update/upgrade' failed temporarily: Waiting for 'cluster_update/before_cluster_update'.
[2025-03-10 13:45:16,227] kopf.objects         [ERROR   ] [control-plane/cluster] Handler 'cluster_update/restart' failed temporarily: Waiting for 'cluster_update/before_cluster_update,cluster_update/upgrade'.
[2025-03-10 13:45:16,367] kopf.objects         [ERROR   ] [control-plane/cluster] Handler 'cluster_update/restore_user_jwt_auth' failed temporarily: Waiting for 'cluster_update/before_cluster_update,cluster_update/upgrade,cluster_update/restart'.
[2025-03-10 13:45:16,506] kopf.objects         [ERROR   ] [control-plane/cluster] Handler 'cluster_update/after_upgrade' failed temporarily: Waiting for 'cluster_update/before_cluster_update,cluster_update/upgrade,cluster_update/restart,cluster_update/restore_user_jwt_auth'.
[2025-03-10 13:45:16,653] kopf.objects         [ERROR   ] [control-plane/cluster] Handler 'cluster_update/after_cluster_update' failed temporarily: Waiting for 'cluster_update/before_cluster_update,cluster_update/upgrade,cluster_update/restart,cluster_update/restore_user_jwt_auth,cluster_update/after_upgrade'.
[2025-03-10 13:45:16,788] kopf.objects         [ERROR   ] [control-plane/cluster] Handler 'cluster_update/notify_success_update' failed temporarily: Waiting for 'cluster_update/before_cluster_update,cluster_update/upgrade,cluster_update/restart,cluster_update/restore_user_jwt_auth,cluster_update/after_upgrade,cluster_update/after_cluster_update'.
[2025-03-10 13:45:17,020] kopf.objects         [INFO    ] [control-plane/cluster] Checking if there are running snapshots ...
[2025-03-10 13:45:17,078] kopf.objects         [INFO    ] [control-plane/cluster] Handler 'cluster_update/before_cluster_update/ensure_no_snapshots_in_progress' succeeded.
[2025-03-10 13:45:17,227] kopf.objects         [INFO    ] [control-plane/cluster] Handler 'cluster_update/before_cluster_update/ensure_no_cronjobs_running' succeeded.
[2025-03-10 13:45:17,497] kopf.objects         [INFO    ] [control-plane/cluster] Trying to set setting cluster.routing.allocation.enable to value new_primaries with mode PERSISTENT
[2025-03-10 13:45:17,546] kopf.objects         [INFO    ] [control-plane/cluster] Handler 'cluster_update/before_cluster_update/set_cluster_routing_allocation_setting' succeeded.
[2025-03-10 13:45:17,547] kopf.objects         [INFO    ] [control-plane/cluster] Handler 'cluster_update/before_cluster_update' succeeded.
[2025-03-10 13:45:46,290] kopf.objects         [INFO    ] [control-plane/cluster] Handler 'cluster_update/upgrade' succeeded.
[2025-03-10 13:45:46,290] kopf.objects         [INFO    ] [control-plane/cluster] Webhooks not configured. Not processing event WebhookEvent.UPGRADE.
[2025-03-10 13:45:46,290] kopf.objects         [INFO    ] [control-plane/cluster] Sending WebhookStatus.IN_PROGRESS notification event WebhookEvent.UPGRADE with payload {'old_registry': 'public.ecr.aws/docker/library/crate', 'new_registry': 'public.ecr.aws/docker/library/crate', 'old_version': '5.8.6', 'new_version': '5.9.8'}
[2025-03-10 13:45:46,459] kopf.objects         [INFO    ] [control-plane/cluster] Handler 'cluster_update/restart' succeeded.
[2025-03-10 13:45:46,611] kopf.objects         [INFO    ] [control-plane/cluster] Handler 'cluster_update/restore_user_jwt_auth' succeeded.
[2025-03-10 13:45:46,753] kopf.objects         [INFO    ] [control-plane/cluster] Handler 'cluster_update/after_upgrade' succeeded.
[2025-03-10 13:45:46,954] kopf.objects         [INFO    ] [control-plane/cluster] Trying to reset setting cluster.routing.allocation.enable
[2025-03-10 13:45:46,997] kopf.objects         [INFO    ] [control-plane/cluster] Handler 'cluster_update/after_cluster_update' succeeded.
[2025-03-10 13:45:46,997] kopf.objects         [INFO    ] [control-plane/cluster] Handler 'cluster_update/after_cluster_update/reset_cluster_routing_allocation_setting' succeeded.
[2025-03-10 13:45:47,141] kopf.objects         [INFO    ] [control-plane/cluster] Handler 'cluster_update/notify_success_update' succeeded.
[2025-03-10 13:45:47,141] kopf.objects         [INFO    ] [control-plane/cluster] Webhooks not configured. Not processing event WebhookEvent.UPGRADE.
[2025-03-10 13:45:47,140] kopf.objects         [INFO    ] [control-plane/cluster] Sending WebhookStatus.SUCCESS notification event WebhookEvent.UPGRADE with payload {'new_registry': 'public.ecr.aws/docker/library/crate', 'new_version': '5.9.8', 'old_registry': 'public.ecr.aws/docker/library/crate', 'old_version': '5.8.6'}
[2025-03-10 13:45:47,142] kopf.objects         [INFO    ] [control-plane/cluster] Updating is processed: 1 succeeded; 0 failed.
[2025-03-10 13:45:47,141] kopf.objects         [INFO    ] [control-plane/cluster] Handler 'cluster_update' succeeded.
[2025-03-10 13:46:11,500] kopf.objects         [INFO    ] [control-plane/cluster] Timer 'ping_cratedb' succeeded.
[2025-03-10 13:46:11,500] kopf.objects         [INFO    ] [control-plane/cluster] Webhooks not configured. Not processing event WebhookEvent.HEALTH.

We noticed that the updateStrategy of the StatefulSet is set to OnDelete by the crate-operator.
If we delete the CrateDB cluster pods manually they are recreated with the new version.
But this is a manual step and our question is if this is what we need to do or if we are missing something and the whole process should be automated.

Before we updated the version of the CrateDB cluster we noticed repeating errors in the crate-operator log:

[2025-03-10 12:57:27,403] kopf.objects         [ERROR   ] [control-plane/cluster] Timer 'ping_cratedb' failed with an exception. Will retry.
Traceback (most recent call last):
  File "main.py", line 349, in ping_cratedb
    hot_node: dict = next(
StopIteration

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/usr/local/lib/python3.8/site-packages/kopf/_core/actions/execution.py", line 279, in execute_handler_once
    result = await invoke_handler(
  File "/usr/local/lib/python3.8/site-packages/kopf/_core/actions/execution.py", line 374, in invoke_handler
    result = await invocation.invoke(
  File "/usr/local/lib/python3.8/site-packages/kopf/_core/actions/invocation.py", line 116, in invoke
    result = await fn(**kwargs)  # type: ignore
RuntimeError: coroutine raised StopIteration

So before we did the update of the version we renamed the data nodes of the CrateDB cluster to “hot”.
This resolved the error so we proceeded with the version update.
But even with this error fixed as mentioned the pods are not recreated.

Regarding this error is this a bug ?
Is naming the data nodes to “hot” mandatory in order for the crate-operator to work properly ?

We also read in the documentation about the:

Cluster Restart:
crate-operator/docs/source/concepts.rst at 67bf7d600c84a3bed51f748c8f3916ea39be1477 · crate/crate-operator · GitHub
Is this what we need to trigger for the pods to be recreated and how can we do that (the documentation only mentions “When instructed to do so”) ?
and the ROLLING_RESTART_TIMEOUT property:
crate-operator/docs/source/configuration.rst at 67bf7d600c84a3bed51f748c8f3916ea39be1477 · crate/crate-operator · GitHub
Is this property something we need to configure ?
And does it count the time before the rolling restart starts or how long the rolling restart will last ?

Best regard,
Darin Venkov

DarinVenkov · May 7, 2025, 2:21pm

We found the root cause of the issue why the pods of the CrateDB were not getting recreated by the crate-operator after we update the version in the CrateDB Cluster K8S resource:

https://github.com/crate/crate-operator/issues/724

The root cause was that Helm was overwriting the app.kubernetes.io/managed-by label with the value Helm. And than this was passed down from the CrateDB Cluster K8S resource to the CrateDB StatefullSet and pods. This prevents the crate-operator from recreating the pods as it doesn’t see them as it looks for them expecting them to have the app.kubernetes.io/managed-by label with the value crate-operator instead.

We fixed this for us with a manual workaround by changing the value of the app.kubernetes.io/managed-by label from Helm to crate-operator on the CrateDB StatefullSet and pods.

Topic		Replies	Views
Master-slave deployment on AWS EKS CrateDB Cloud	10	1494	November 5, 2020
CrateDB does not select a leader CrateDB getting-started	5	338	October 6, 2023
Install the CrateDB k8s operator with Helm Community	0	566	October 3, 2022
Using crate-operator the installation fails with "AttachVolume.Attach failed for volume" Installation data-storage	6	90	November 30, 2024
Runnig CrateDB from Docker it restarts every few days CrateDB	3	657	April 29, 2022

Crate version update in K8S not fully automated?

Related topics