Crate version update in K8S not fully automated?

Hi all,

We have a Kubernetes cluster with a crate-operator (version 2.45.0) and a Crate cluster with 3 instances created from a CrateDB resource.

As a result in the cluster we have one pod for the crate-operator (created from a ReplicaSet of a Deployment) and three pods for the CrateDB cluster (created from a StatefulSet of the CrateDB resource).

The issue we are having is that when we update the version of the CrateDB cluster in the CrateDB resource the crate-operator picks this up and applies this change to the StatefulSet as well but the three pods of the CrateDB cluster are left unchanged still with the previous version.

Looking at the log of the crate-operator the update finished successfully:

[2025-03-10 13:45:11,380] kopf.objects         [INFO    ] [control-plane/cluster] Webhooks not configured. Not processing event WebhookEvent.HEALTH.
[2025-03-10 13:45:11,380] kopf.objects         [INFO    ] [control-plane/cluster] Timer 'ping_cratedb' succeeded.
[2025-03-10 13:45:15,956] kopf.objects         [INFO    ] [control-plane/cluster] Handler 'cluster_update/before_cluster_update/disable_cronjob' succeeded.
[2025-03-10 13:45:16,076] kopf.objects         [INFO    ] [control-plane/cluster] Timer 'enable_cronjob_after_delay/status.delay_cronjob' succeeded.
[2025-03-10 13:45:16,090] kopf.objects         [ERROR   ] [control-plane/cluster] Handler 'cluster_update/upgrade' failed temporarily: Waiting for 'cluster_update/before_cluster_update'.
[2025-03-10 13:45:16,227] kopf.objects         [ERROR   ] [control-plane/cluster] Handler 'cluster_update/restart' failed temporarily: Waiting for 'cluster_update/before_cluster_update,cluster_update/upgrade'.
[2025-03-10 13:45:16,367] kopf.objects         [ERROR   ] [control-plane/cluster] Handler 'cluster_update/restore_user_jwt_auth' failed temporarily: Waiting for 'cluster_update/before_cluster_update,cluster_update/upgrade,cluster_update/restart'.
[2025-03-10 13:45:16,506] kopf.objects         [ERROR   ] [control-plane/cluster] Handler 'cluster_update/after_upgrade' failed temporarily: Waiting for 'cluster_update/before_cluster_update,cluster_update/upgrade,cluster_update/restart,cluster_update/restore_user_jwt_auth'.
[2025-03-10 13:45:16,653] kopf.objects         [ERROR   ] [control-plane/cluster] Handler 'cluster_update/after_cluster_update' failed temporarily: Waiting for 'cluster_update/before_cluster_update,cluster_update/upgrade,cluster_update/restart,cluster_update/restore_user_jwt_auth,cluster_update/after_upgrade'.
[2025-03-10 13:45:16,788] kopf.objects         [ERROR   ] [control-plane/cluster] Handler 'cluster_update/notify_success_update' failed temporarily: Waiting for 'cluster_update/before_cluster_update,cluster_update/upgrade,cluster_update/restart,cluster_update/restore_user_jwt_auth,cluster_update/after_upgrade,cluster_update/after_cluster_update'.
[2025-03-10 13:45:17,020] kopf.objects         [INFO    ] [control-plane/cluster] Checking if there are running snapshots ...
[2025-03-10 13:45:17,078] kopf.objects         [INFO    ] [control-plane/cluster] Handler 'cluster_update/before_cluster_update/ensure_no_snapshots_in_progress' succeeded.
[2025-03-10 13:45:17,227] kopf.objects         [INFO    ] [control-plane/cluster] Handler 'cluster_update/before_cluster_update/ensure_no_cronjobs_running' succeeded.
[2025-03-10 13:45:17,497] kopf.objects         [INFO    ] [control-plane/cluster] Trying to set setting cluster.routing.allocation.enable to value new_primaries with mode PERSISTENT
[2025-03-10 13:45:17,546] kopf.objects         [INFO    ] [control-plane/cluster] Handler 'cluster_update/before_cluster_update/set_cluster_routing_allocation_setting' succeeded.
[2025-03-10 13:45:17,547] kopf.objects         [INFO    ] [control-plane/cluster] Handler 'cluster_update/before_cluster_update' succeeded.
[2025-03-10 13:45:46,290] kopf.objects         [INFO    ] [control-plane/cluster] Handler 'cluster_update/upgrade' succeeded.
[2025-03-10 13:45:46,290] kopf.objects         [INFO    ] [control-plane/cluster] Webhooks not configured. Not processing event WebhookEvent.UPGRADE.
[2025-03-10 13:45:46,290] kopf.objects         [INFO    ] [control-plane/cluster] Sending WebhookStatus.IN_PROGRESS notification event WebhookEvent.UPGRADE with payload {'old_registry': 'public.ecr.aws/docker/library/crate', 'new_registry': 'public.ecr.aws/docker/library/crate', 'old_version': '5.8.6', 'new_version': '5.9.8'}
[2025-03-10 13:45:46,459] kopf.objects         [INFO    ] [control-plane/cluster] Handler 'cluster_update/restart' succeeded.
[2025-03-10 13:45:46,611] kopf.objects         [INFO    ] [control-plane/cluster] Handler 'cluster_update/restore_user_jwt_auth' succeeded.
[2025-03-10 13:45:46,753] kopf.objects         [INFO    ] [control-plane/cluster] Handler 'cluster_update/after_upgrade' succeeded.
[2025-03-10 13:45:46,954] kopf.objects         [INFO    ] [control-plane/cluster] Trying to reset setting cluster.routing.allocation.enable
[2025-03-10 13:45:46,997] kopf.objects         [INFO    ] [control-plane/cluster] Handler 'cluster_update/after_cluster_update' succeeded.
[2025-03-10 13:45:46,997] kopf.objects         [INFO    ] [control-plane/cluster] Handler 'cluster_update/after_cluster_update/reset_cluster_routing_allocation_setting' succeeded.
[2025-03-10 13:45:47,141] kopf.objects         [INFO    ] [control-plane/cluster] Handler 'cluster_update/notify_success_update' succeeded.
[2025-03-10 13:45:47,141] kopf.objects         [INFO    ] [control-plane/cluster] Webhooks not configured. Not processing event WebhookEvent.UPGRADE.
[2025-03-10 13:45:47,140] kopf.objects         [INFO    ] [control-plane/cluster] Sending WebhookStatus.SUCCESS notification event WebhookEvent.UPGRADE with payload {'new_registry': 'public.ecr.aws/docker/library/crate', 'new_version': '5.9.8', 'old_registry': 'public.ecr.aws/docker/library/crate', 'old_version': '5.8.6'}
[2025-03-10 13:45:47,142] kopf.objects         [INFO    ] [control-plane/cluster] Updating is processed: 1 succeeded; 0 failed.
[2025-03-10 13:45:47,141] kopf.objects         [INFO    ] [control-plane/cluster] Handler 'cluster_update' succeeded.
[2025-03-10 13:46:11,500] kopf.objects         [INFO    ] [control-plane/cluster] Timer 'ping_cratedb' succeeded.
[2025-03-10 13:46:11,500] kopf.objects         [INFO    ] [control-plane/cluster] Webhooks not configured. Not processing event WebhookEvent.HEALTH.

We noticed that the updateStrategy of the StatefulSet is set to OnDelete by the crate-operator.
If we delete the CrateDB cluster pods manually they are recreated with the new version.
But this is a manual step and our question is if this is what we need to do or if we are missing something and the whole process should be automated.

Before we updated the version of the CrateDB cluster we noticed repeating errors in the crate-operator log:

[2025-03-10 12:57:27,403] kopf.objects         [ERROR   ] [control-plane/cluster] Timer 'ping_cratedb' failed with an exception. Will retry.
Traceback (most recent call last):
  File "main.py", line 349, in ping_cratedb
    hot_node: dict = next(
StopIteration

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/usr/local/lib/python3.8/site-packages/kopf/_core/actions/execution.py", line 279, in execute_handler_once
    result = await invoke_handler(
  File "/usr/local/lib/python3.8/site-packages/kopf/_core/actions/execution.py", line 374, in invoke_handler
    result = await invocation.invoke(
  File "/usr/local/lib/python3.8/site-packages/kopf/_core/actions/invocation.py", line 116, in invoke
    result = await fn(**kwargs)  # type: ignore
RuntimeError: coroutine raised StopIteration

So before we did the update of the version we renamed the data nodes of the CrateDB cluster to “hot”.
This resolved the error so we proceeded with the version update.
But even with this error fixed as mentioned the pods are not recreated.

Regarding this error is this a bug ?
Is naming the data nodes to “hot” mandatory in order for the crate-operator to work properly ?

We also read in the documentation about the:

  1. Cluster Restart:
    crate-operator/docs/source/concepts.rst at 67bf7d600c84a3bed51f748c8f3916ea39be1477 · crate/crate-operator · GitHub
    Is this what we need to trigger for the pods to be recreated and how can we do that (the documentation only mentions “When instructed to do so”) ?

  2. and the ROLLING_RESTART_TIMEOUT property:
    crate-operator/docs/source/configuration.rst at 67bf7d600c84a3bed51f748c8f3916ea39be1477 · crate/crate-operator · GitHub
    Is this property something we need to configure ?
    And does it count the time before the rolling restart starts or how long the rolling restart will last ?

Best regard,
Darin Venkov