HA was not achieved

Hi Team,

We have 4 node clusters, and the table has 0-1 replica.
But when one node goes down, we get below error and HA was not achieved;

The value of the cluster setting ‘gateway.expected_data_nodes’ (or the deprecated gateway.recovery_after_nodes setting)must be equal to the maximum/expected number of (data) nodes in the cluster

what will be the best practice for setting the below parameters

  1. gateway.expected_data_nodes
  2. gateway.recovery_after_nodes

Thanks
Vinayak Katkar

This configuration is primarily important for conducting a full cluster restart, rather than for regular operations. It is a warning, not an error.

HA was not achieved;

How do you determine this?

what will be the best practice for setting the below parameters

It’s advisable to match it with the number of nodes in your cluster that store data. This recommendation varies based on your specific configuration, such as whether you’re employing zoning or other setup details.

[2024-04-03T15:27:21,667][WARN ][o.e.m.j.JvmGcMonitorService] [Node3] [gc][215] overhead, spent [669ms] collecting in the last [1s]
[2024-04-03T15:27:23,676][WARN ][o.e.m.j.JvmGcMonitorService] [Node3] [gc][217] overhead, spent [635ms] collecting in the last [1s]
[2024-04-03T15:29:05,004][INFO ][o.e.m.j.JvmGcMonitorService] [Node3] [gc][318] overhead, spent [467ms] collecting in the last [1s]
[2024-04-03T15:29:30,011][INFO ][o.e.m.j.JvmGcMonitorService] [Node3] [gc][343] overhead, spent [254ms] collecting in the last [1s]
[2024-04-03T15:37:58,335][INFO ][o.e.m.j.JvmGcMonitorService] [Node3] [gc][851] overhead, spent [324ms] collecting in the last [1s]
[2024-04-03T15:38:36,795][WARN ][o.e.m.j.JvmGcMonitorService] [Node3] [gc][young][887][47] duration [3.2s], collections [1]/[3.4s], total [3.2s]/[10s], memory [19.5gb]->[5.4gb]/[28gb], all_pools {[young] [16.6gb]->[0b]/[0b]}{[old] [2.7gb]->[3.3gb]/[28gb]}{[survivor] [48mb]->[2.1gb]/[0b]}
[2024-04-03T15:38:36,796][WARN ][o.e.m.j.JvmGcMonitorService] [Node3] [gc][887] overhead, spent [3.2s] collecting in the last [3.4s]
[2024-04-03T15:38:40,210][WARN ][o.e.m.j.JvmGcMonitorService] [Node3] [gc][young][888][48] duration [3.3s], collections [1]/[3.4s], total [3.3s]/[13.4s], memory [5.4gb]->[5.5gb]/[28gb], all_pools {[young] [0b]->[16mb]/[0b]}{[old] [3.3gb]->[5.4gb]/[28gb]}{[survivor] [2.1gb]->[32mb]/[0b]}
[2024-04-03T15:38:40,210][WARN ][o.e.m.j.JvmGcMonitorService] [Node3] [gc][888] overhead, spent [3.3s] collecting in the last [3.4s]
[2024-04-03T15:39:20,437][WARN ][o.e.m.j.JvmGcMonitorService] [Node3] [gc][young][926][53] duration [2.4s], collections [1]/[3.2s], total [2.4s]/[16s], memory [10.3gb]->[7.8gb]/[28gb], all_pools {[young] [4.4gb]->[0b]/[0b]}{[old] [5.6gb]->[7.2gb]/[28gb]}{[survivor] [160mb]->[640mb]/[0b]}
[2024-04-03T15:39:20,438][WARN ][o.e.m.j.JvmGcMonitorService] [Node3] [gc][926] overhead, spent [2.4s] collecting in the last [3.2s]
[2024-04-03T15:39:29,422][WARN ][o.e.m.j.JvmGcMonitorService] [Node3] [gc][young][928][54] duration [7.2s], collections [1]/[7.9s], total [7.2s]/[23.2s], memory [8gb]->[8gb]/[28gb], all_pools {[young] [192mb]->[0b]/[0b]}{[old] [7.2gb]->[7.8gb]/[28gb]}{[survivor] [640mb]->[184mb]/[0b]}
[2024-04-03T15:39:29,423][WARN ][o.e.m.j.JvmGcMonitorService] [Node3] [gc][928] overhead, spent [7.2s] collecting in the last [7.9s]
[2024-04-03T15:39:32,424][INFO ][o.e.m.j.JvmGcMonitorService] [Node3] [gc][931] overhead, spent [250ms] collecting in the last [1s]

above error/warning coming to cluster
1 Node is going down continuously should I need to upgrade ram or the issue is different?

Thanks
Vinayak Katkar

The logs indicate significant garbage collection overhead on the CrateDB node, which could impact performance. This suggests the node’s issues might stem from multiple factors, including insufficient RAM. However, it’s also crucial to consider JVM settings and the specifics of the workloads on the node. Understanding these aspects better could help address the performance issues without necessarily upgrading the RAM.

Hi,
We have 4 node cluster setup and we have given 26GB out of 30GB of Total memory.
And every node has the same specification. And suddenly from 1st April above error and node were continuously evicted. we tried our best to find but were not able to find any root cause. If possible please help me with it, it affects the production.

Thanks
Vinayak Katkar

Without understanding your complete setup and how you are interacting with the cluster this seems rather difficult to assess. And this probably requires someone form the professional services / support team. Also the logs you posted so far are just warnings, but don’t point to a concrete issue.

What do you mean by 26 out of 30 GB? 26 GB heap out of total memory? If that is the case, this seems to be a lot. The default is 25% heap out of total system memory to allow CrateDB to use memory mapping properly.

Thanks for your response.
Yes, we were given 26GB to heap out of 30GB of system memory.

If we want to solve this performance-related issue whom do we need to contact?

This seems to be rather much. With CrateDB 4.2 and newer we typically recommend to use 25% of system memory (i.e. 7.5 GiB) and only increase over 50% in special cases.

How much data do you have stored in CrateDB? What volumes are you typically writing per day?

Please use the contact form:

Ok thanks
We will contact to Crate Support.