HA was not achieved

Vinayak_Katkar · April 3, 2024, 7:52am

Hi Team,

We have 4 node clusters, and the table has 0-1 replica.
But when one node goes down, we get below error and HA was not achieved;

The value of the cluster setting ‘gateway.expected_data_nodes’ (or the deprecated gateway.recovery_after_nodes setting)must be equal to the maximum/expected number of (data) nodes in the cluster

what will be the best practice for setting the below parameters

gateway.expected_data_nodes
gateway.recovery_after_nodes

Thanks
Vinayak Katkar

proddata · April 3, 2024, 7:58am

This configuration is primarily important for conducting a full cluster restart, rather than for regular operations. It is a warning, not an error.

HA was not achieved;

How do you determine this?

what will be the best practice for setting the below parameters

It’s advisable to match it with the number of nodes in your cluster that store data. This recommendation varies based on your specific configuration, such as whether you’re employing zoning or other setup details.

Vinayak_Katkar · April 3, 2024, 10:25am

[2024-04-03T15:27:21,667][WARN ][o.e.m.j.JvmGcMonitorService] [Node3] [gc][215] overhead, spent [669ms] collecting in the last [1s]
[2024-04-03T15:27:23,676][WARN ][o.e.m.j.JvmGcMonitorService] [Node3] [gc][217] overhead, spent [635ms] collecting in the last [1s]
[2024-04-03T15:29:05,004][INFO ][o.e.m.j.JvmGcMonitorService] [Node3] [gc][318] overhead, spent [467ms] collecting in the last [1s]
[2024-04-03T15:29:30,011][INFO ][o.e.m.j.JvmGcMonitorService] [Node3] [gc][343] overhead, spent [254ms] collecting in the last [1s]
[2024-04-03T15:37:58,335][INFO ][o.e.m.j.JvmGcMonitorService] [Node3] [gc][851] overhead, spent [324ms] collecting in the last [1s]
[2024-04-03T15:38:36,795][WARN ][o.e.m.j.JvmGcMonitorService] [Node3] [gc][young][887][47] duration [3.2s], collections [1]/[3.4s], total [3.2s]/[10s], memory [19.5gb]->[5.4gb]/[28gb], all_pools {[young] [16.6gb]->[0b]/[0b]}{[old] [2.7gb]->[3.3gb]/[28gb]}{[survivor] [48mb]->[2.1gb]/[0b]}
[2024-04-03T15:38:36,796][WARN ][o.e.m.j.JvmGcMonitorService] [Node3] [gc][887] overhead, spent [3.2s] collecting in the last [3.4s]
[2024-04-03T15:38:40,210][WARN ][o.e.m.j.JvmGcMonitorService] [Node3] [gc][young][888][48] duration [3.3s], collections [1]/[3.4s], total [3.3s]/[13.4s], memory [5.4gb]->[5.5gb]/[28gb], all_pools {[young] [0b]->[16mb]/[0b]}{[old] [3.3gb]->[5.4gb]/[28gb]}{[survivor] [2.1gb]->[32mb]/[0b]}
[2024-04-03T15:38:40,210][WARN ][o.e.m.j.JvmGcMonitorService] [Node3] [gc][888] overhead, spent [3.3s] collecting in the last [3.4s]
[2024-04-03T15:39:20,437][WARN ][o.e.m.j.JvmGcMonitorService] [Node3] [gc][young][926][53] duration [2.4s], collections [1]/[3.2s], total [2.4s]/[16s], memory [10.3gb]->[7.8gb]/[28gb], all_pools {[young] [4.4gb]->[0b]/[0b]}{[old] [5.6gb]->[7.2gb]/[28gb]}{[survivor] [160mb]->[640mb]/[0b]}
[2024-04-03T15:39:20,438][WARN ][o.e.m.j.JvmGcMonitorService] [Node3] [gc][926] overhead, spent [2.4s] collecting in the last [3.2s]
[2024-04-03T15:39:29,422][WARN ][o.e.m.j.JvmGcMonitorService] [Node3] [gc][young][928][54] duration [7.2s], collections [1]/[7.9s], total [7.2s]/[23.2s], memory [8gb]->[8gb]/[28gb], all_pools {[young] [192mb]->[0b]/[0b]}{[old] [7.2gb]->[7.8gb]/[28gb]}{[survivor] [640mb]->[184mb]/[0b]}
[2024-04-03T15:39:29,423][WARN ][o.e.m.j.JvmGcMonitorService] [Node3] [gc][928] overhead, spent [7.2s] collecting in the last [7.9s]
[2024-04-03T15:39:32,424][INFO ][o.e.m.j.JvmGcMonitorService] [Node3] [gc][931] overhead, spent [250ms] collecting in the last [1s]

above error/warning coming to cluster
1 Node is going down continuously should I need to upgrade ram or the issue is different?

Thanks
Vinayak Katkar

proddata · April 3, 2024, 12:36pm

The logs indicate significant garbage collection overhead on the CrateDB node, which could impact performance. This suggests the node’s issues might stem from multiple factors, including insufficient RAM. However, it’s also crucial to consider JVM settings and the specifics of the workloads on the node. Understanding these aspects better could help address the performance issues without necessarily upgrading the RAM.

Vinayak_Katkar · April 4, 2024, 7:03am

Hi,
We have 4 node cluster setup and we have given 26GB out of 30GB of Total memory.
And every node has the same specification. And suddenly from 1st April above error and node were continuously evicted. we tried our best to find but were not able to find any root cause. If possible please help me with it, it affects the production.

Thanks
Vinayak Katkar

proddata · April 4, 2024, 7:21am

Without understanding your complete setup and how you are interacting with the cluster this seems rather difficult to assess. And this probably requires someone form the professional services / support team. Also the logs you posted so far are just warnings, but don’t point to a concrete issue.

What do you mean by 26 out of 30 GB? 26 GB heap out of total memory? If that is the case, this seems to be a lot. The default is 25% heap out of total system memory to allow CrateDB to use memory mapping properly.

Vinayak_Katkar · April 4, 2024, 7:30am

Thanks for your response.
Yes, we were given 26GB to heap out of 30GB of system memory.

If we want to solve this performance-related issue whom do we need to contact?

proddata · April 4, 2024, 7:38am

This seems to be rather much. With CrateDB 4.2 and newer we typically recommend to use 25% of system memory (i.e. 7.5 GiB) and only increase over 50% in special cases.

How much data do you have stored in CrateDB? What volumes are you typically writing per day?

Please use the contact form:

Vinayak_Katkar · April 4, 2024, 9:01am

Ok thanks
We will contact to Crate Support.

Topic		Replies	Views
Why do I get these disk watermark exceeded errors and those gateway errors on a newly created 3-node cluster? CrateDB	2	630	October 1, 2021
CrateDB nodes constantly crashing CrateDB	2	760	September 30, 2020
3 Node Cluster Not Working CrateDB	2	1234	February 24, 2020
Crate DB Clustering on EC2 CrateDB	13	2469	May 14, 2021
Data too large?! I have 24Gb or RAM for 3 nodes and 12 cores CrateDB	1	1736	May 31, 2020

HA was not achieved

Related topics