2 node cluster recurrent issue with request timing out

Hi All,
I’m running a 2 nodes integration-testing cluster on version 5.6.5, same version as my production version.
I’ve got a very basic setup on each node :

# Networking: Bind to an IP address or interface other than localhost.
# Be careful! Never expose an unprotected node to the internet.
# Choose from [IP Address], _local_, _site_, _global_ or _[networkInterface]_
#network.host: _site_
#network.host: _local_
network.host: 0.0.0.0

# Cluster discovery: Specify the hosts which will form the CrateDB cluster
discovery.seed_hosts:
  - 192.168.2.1
  - 192.168.2.12

# Bootstrap the cluster using an initial set of master-eligible nodes. All
# master-eligible nodes must be set here for new (or upgraded from CrateDB < 4.x)
# production (non loop-back bound) clusters otherwise the cluster is not able to
# initially vote an initial master node:
#
cluster.initial_master_nodes:
  - 192.168.2.1
  - 192.168.2.12

The two nodes are Linux hosts (one Fedora 40, one Debian 12) directly linked by a 1Gbps ethernet cable, no switch …

I’m constantly running into a synchronization issue between the 2 nodes and I cannot pinpoint what the issue is. Here’s a startup sequence on one of the two nodes. One can see that a master is elected and the cluster performs a sync and turns to green.

[2025-04-14T15:41:57,647][INFO ][o.e.e.NodeEnvironment    ] [Mont Tondu] using [1] data paths, mounts [[/home (/dev/mapper/osuperset--vg-home)]], net usable_space [663.2gb], net total_space [880gb], types [ext4]
[2025-04-14T15:41:57,650][INFO ][o.e.e.NodeEnvironment    ] [Mont Tondu] heap size [3gb], compressed ordinary object pointers [true]
[2025-04-14T15:41:57,982][INFO ][o.e.n.Node               ] [Mont Tondu] node name [Mont Tondu], node ID [rjcyAsKBQXOJo8JkfVygeg], cluster name [o-cell-test]
[2025-04-14T15:41:57,983][INFO ][o.e.n.Node               ] [Mont Tondu] version[5.6.5], pid[48244], build[5db11af/NA], OS[Linux/6.1.0-32-amd64/amd64], JVM[Eclipse Adoptium/OpenJDK 64-Bit Server VM/21.0.3+9-LTS]
[2025-04-14T15:41:58,715][INFO ][o.e.p.PluginsService     ] [Mont Tondu] no modules loaded
[2025-04-14T15:41:58,718][INFO ][o.e.p.PluginsService     ] [Mont Tondu] loaded plugin [cr8-copy-s3]
[2025-04-14T15:41:58,718][INFO ][o.e.p.PluginsService     ] [Mont Tondu] loaded plugin [crate-functions]
[2025-04-14T15:41:58,719][INFO ][o.e.p.PluginsService     ] [Mont Tondu] loaded plugin [crate-jmx-monitoring]
[2025-04-14T15:41:58,719][INFO ][o.e.p.PluginsService     ] [Mont Tondu] loaded plugin [crate-lang-js]
[2025-04-14T15:41:58,719][INFO ][o.e.p.PluginsService     ] [Mont Tondu] loaded plugin [es-analysis-common]
[2025-04-14T15:41:58,719][INFO ][o.e.p.PluginsService     ] [Mont Tondu] loaded plugin [es-analysis-phonetic]
[2025-04-14T15:41:58,719][INFO ][o.e.p.PluginsService     ] [Mont Tondu] loaded plugin [es-repository-azure]
[2025-04-14T15:41:58,720][INFO ][o.e.p.PluginsService     ] [Mont Tondu] loaded plugin [io.crate.plugin.SrvPlugin]
[2025-04-14T15:41:58,720][INFO ][o.e.p.PluginsService     ] [Mont Tondu] loaded plugin [io.crate.udc.plugin.UDCPlugin]
[2025-04-14T15:41:58,720][INFO ][o.e.p.PluginsService     ] [Mont Tondu] loaded plugin [org.elasticsearch.discovery.ec2.Ec2DiscoveryPlugin]
[2025-04-14T15:41:58,720][INFO ][o.e.p.PluginsService     ] [Mont Tondu] loaded plugin [org.elasticsearch.plugin.repository.url.URLRepositoryPlugin]
[2025-04-14T15:41:58,720][INFO ][o.e.p.PluginsService     ] [Mont Tondu] loaded plugin [org.elasticsearch.repositories.s3.S3RepositoryPlugin]
[2025-04-14T15:41:58,720][INFO ][o.e.p.PluginsService     ] [Mont Tondu] loaded plugin [org.elasticsearch.transport.Netty4Plugin]
[2025-04-14T15:41:59,755][INFO ][o.e.d.DiscoveryModule    ] [Mont Tondu] using discovery type [zen] and seed hosts providers [settings]
[2025-04-14T15:42:01,182][INFO ][psql                     ] [Mont Tondu] PSQL SSL support is disabled.
[2025-04-14T15:42:01,539][INFO ][i.c.n.c.PipelineRegistry ] [Mont Tondu] HTTP SSL support is disabled.
[2025-04-14T15:42:01,572][WARN ][o.e.g.DanglingIndicesState] [Mont Tondu] gateway.auto_import_dangling_indices is disabled, dangling indices will not be detected or imported
[2025-04-14T15:42:01,652][INFO ][o.e.n.Node               ] [Mont Tondu] initialized
[2025-04-14T15:42:01,653][INFO ][o.e.n.Node               ] [Mont Tondu] starting ...
[2025-04-14T15:42:01,719][INFO ][psql                     ] [Mont Tondu] publish_address {192.168.2.12:5432}, bound_addresses {[::]:5432}
[2025-04-14T15:42:01,727][INFO ][o.e.h.n.Netty4HttpServerTransport] [Mont Tondu] publish_address {192.168.2.12:4200}, bound_addresses {[::]:4200}
[2025-04-14T15:42:01,735][INFO ][o.e.t.TransportService   ] [Mont Tondu] publish_address {192.168.2.12:4300}, bound_addresses {[::]:4300}
[2025-04-14T15:42:02,381][INFO ][o.e.b.BootstrapChecks    ] [Mont Tondu] bound or publishing to a non-loopback address, enforcing bootstrap checks
[2025-04-14T15:42:02,386][INFO ][o.e.c.c.Coordinator      ] [Mont Tondu] cluster UUID [FAQR5_pWS3uiyaQaKTiVdQ]

[2025-04-14T16:05:43,968][WARN ][o.e.c.r.a.AllocationService] [Mont Tondu] [ocell..partitioned.message.04732dhg74q3ae9i60o30c1g][0] marking unavailable shards as stale: [FONfliKKQAaPDQ5CQvY1iA]
[2025-04-14T15:42:02,568][INFO ][o.e.c.s.MasterService    ] [Mont Tondu] elected-as-master ([1] nodes joined)[{Mont Tondu}{rjcyAsKBQXOJo8JkfVygeg}{J-HNIfwrTlule6sbilMK6Q}{192.168.2.12}{192.168.2.12:4300}{dm}{http_address=192.168.2.12:4200} elect leader, _BECOME_MASTER_TASK_, _FINISH_ELECTION_], term: 21, version: 151301, reason: master node changed {previous [], current [{Mont Tondu}{rjcyAsKBQXOJo8JkfVygeg}{J-HNIfwrTlule6sbilMK6Q}{192.168.2.12}{192.168.2.12:4300}{dm}{http_address=192.168.2.12:4200}]}
[2025-04-14T15:42:02,887][INFO ][o.e.c.s.ClusterApplierService] [Mont Tondu] master node changed {previous [], current [{Mont Tondu}{rjcyAsKBQXOJo8JkfVygeg}{J-HNIfwrTlule6sbilMK6Q}{192.168.2.12}{192.168.2.12:4300}{dm}{http_address=192.168.2.12:4200}]}, term: 21, version: 151301, reason: Publication{term=21, version=151301}
[2025-04-14T15:42:02,899][INFO ][o.e.n.Node               ] [Mont Tondu] started
[2025-04-14T15:42:03,118][INFO ][o.e.c.r.a.AllocationService] [Mont Tondu] updating number_of_replicas to [1] for indices [ocell.user, ocell..partitioned.sensor_location.04732dhn68qj6c9i60o30c1g, ocell.data_source_type, ocell..partitioned.message.04732d9n6ss36dho60o30c1g, ocell.housing, ocell..partitioned.sensor_location.04732dhg74q3ae9i60o30c1g, ocell.synthetic_indicator_source, ocell..partitioned.indicators.04732dhn68qj6c9i60o30c1g, ocell..partitioned.message.04732d1i60o3ec1k60o30c1g, ocell..partitioned.indicators.04732cpi6kpjedhg60o30c1g, ocell.camping, ocell..partitioned.message.04732dpg6go3cdpi60o30c1g, ocell.area, ocell.events_log, ocell..partitioned.message.04732d9h6grjcd1o60o30c1g, ocell..partitioned.indicators.04732cpo70qj6d1k60o30c1g, ocell.gateway_status, ocell.gateway_type, ocell..partitioned.sensor_location.04732d9n6ss36dho60o30c1g, ocell.monthly_report, ocell.sensor, ocell.service_type, ocell..partitioned.indicators.04732dpj6kr3ge9m60o30c1g, ocell.pms_data, ocell.notification_comment, ocell.notification, ocell..partitioned.indicators.04732dhg74q3ae9i60o30c1g, ocell..partitioned.indicators.04732d9h6grjcd1o60o30c1g, ocell..partitioned.indicators.04732d1l64r30dhk60o30c1g, ocell..partitioned.sensor_location.04732d9h6grjcd1o60o30c1g, ocell..partitioned.message.04732dhn68qj6c9i60o30c1g, ocell..partitioned.indicators.04732d1o6cp34e1o60o30c1g, ocell..partitioned.sensor_location.04732dpg6go3cdpi60o30c1g, ocell..partitioned.indicators.04732d9k6opj0c1o60o30c1g, ocell..partitioned.message.04732dhk60sjid9i60o30c1g, ocell..partitioned.sensor_location.04732dpj6kr3ge9m60o30c1g, ocell.user_login, ocell..partitioned.indicators.04732dhk60sjid9i60o30c1g, ocell.location, ocell..partitioned.message.04732dpj6kr3ge9m60o30c1g, ocell..partitioned.indicators.04732dpg6go3cdpi60o30c1g, ocell.data_source, ocell..partitioned.message.04732dhg74q3ae9i60o30c1g, ocell..partitioned.message.04732cpo70qj6d1k60o30c1g, ocell.user_group, ocell.site, ocell..partitioned.message.04732d1o6cp34e1o60o30c1g, ocell.sms_log, ocell.gateway, ocell..partitioned.sensor_location.04732d1o6cp34e1o60o30c1g, ocell..partitioned.sensor_location.04732d9k6opj0c1o60o30c1g, ocell..partitioned.indicators.0472qdpl6spjgchk60o30c1g, ocell.api_token, ocell..partitioned.message.04732d1l64r30dhk60o30c1g, ocell.indicator_source, ocell.energy_conversion_factor, ocell.provider_outage, ocell..partitioned.sensor_location.04732dhk60sjid9i60o30c1g, ocell..partitioned.message.04732d9k6opj0c1o60o30c1g, ocell..partitioned.indicators.04732d9n6ss36dho60o30c1g, ocell.message_type, ocell..partitioned.indicators.04732d1i60o3ec1k60o30c1g, ocell.info_client]
[2025-04-14T15:42:03,120][INFO ][o.e.c.s.MasterService    ] [Mont Tondu] node-join[{Männliflue}{IS3cAjDDR_6o5kpi0RBEzA}{WfEG3gvGRqioIw-KN-CYdg}{192.168.2.1}{192.168.2.1:4300}{dm}{http_address=192.168.2.1:4200} join existing leader], term: 21, version: 151302, reason: added {{Männliflue}{IS3cAjDDR_6o5kpi0RBEzA}{WfEG3gvGRqioIw-KN-CYdg}{192.168.2.1}{192.168.2.1:4300}{dm}{http_address=192.168.2.1:4200},}
[2025-04-14T15:42:07,505][INFO ][o.e.c.s.ClusterApplierService] [Mont Tondu] added {{Männliflue}{IS3cAjDDR_6o5kpi0RBEzA}{WfEG3gvGRqioIw-KN-CYdg}{192.168.2.1}{192.168.2.1:4300}{dm}{http_address=192.168.2.1:4200},}, term: 21, version: 151302, reason: Publication{term=21, version=151302}
[2025-04-14T15:42:10,204][INFO ][o.e.c.s.ClusterSettings  ] [Mont Tondu] updating [stats.service.interval] from [24h] to [0]
[2025-04-14T15:42:10,222][INFO ][o.e.g.GatewayService     ] [Mont Tondu] recovered [63] indices into cluster_state
[2025-04-14T15:42:19,834][WARN ][o.e.c.r.a.AllocationService] [Mont Tondu] [ocell..partitioned.sensor_location.04732dpj6kr3ge9m60o30c1g][0] marking unavailable shards as stale: [t4afzN2XT4-Or0H4pddogw]
[2025-04-14T15:42:19,835][WARN ][o.e.c.r.a.AllocationService] [Mont Tondu] [ocell..partitioned.message.04732dpj6kr3ge9m60o30c1g][0] marking unavailable shards as stale: [6iAJlz2XRBS4YEKA7u8vIw]
[2025-04-14T15:42:28,105][WARN ][o.e.c.r.a.AllocationService] [Mont Tondu] [ocell.sensor][0] marking unavailable shards as stale: [J3sUjwb3S_6bVGvrERwQtw]
[2025-04-14T15:42:36,887][WARN ][o.e.c.r.a.AllocationService] [Mont Tondu] [ocell.notification][0] marking unavailable shards as stale: [w6C5Wrf7SOCXgKwHetUNSQ]
[2025-04-14T15:42:52,027][INFO ][o.e.c.r.a.AllocationService] [Mont Tondu] Cluster health status changed from [RED] to [YELLOW] (reason: [shards started [[ocell..partitioned.indicators.04732d1i60o3ec1k60o30c1g][0]]]).
[2025-04-14T15:42:55,040][WARN ][o.e.t.TransportService   ] [Mont Tondu] Received response for a request that has timed out, sent [13425ms] ago, timed out [3402ms] ago, action [internal:coordination/fault_detection/follower_check], node [{Männliflue}{IS3cAjDDR_6o5kpi0RBEzA}{WfEG3gvGRqioIw-KN-CYdg}{192.168.2.1}{192.168.2.1:4300}{dm}{http_address=192.168.2.1:4200}], id [421]
[2025-04-14T15:44:54,848][WARN ][o.e.t.TransportService   ] [Mont Tondu] Received response for a request that has timed out, sent [26625ms] ago, timed out [16616ms] ago, action [internal:coordination/fault_detection/follower_check], node [{Männliflue}{IS3cAjDDR_6o5kpi0RBEzA}{WfEG3gvGRqioIw-KN-CYdg}{192.168.2.1}{192.168.2.1:4300}{dm}{http_address=192.168.2.1:4200}], id [8310]
[2025-04-14T15:44:54,849][WARN ][o.e.t.TransportService   ] [Mont Tondu] Received response for a request that has timed out, sent [15615ms] ago, timed out [5606ms] ago, action [internal:coordination/fault_detection/follower_check], node [{Männliflue}{IS3cAjDDR_6o5kpi0RBEzA}{WfEG3gvGRqioIw-KN-CYdg}{192.168.2.1}{192.168.2.1:4300}{dm}{http_address=192.168.2.1:4200}], id [8343]
[2025-04-14T15:46:29,733][INFO ][o.e.c.r.a.AllocationService] [Mont Tondu] Cluster health status changed from [YELLOW] to [GREEN] (reason: [shards started [[ocell..partitioned.indicators.04732dhg74q3ae9i60o30c1g][0]]]).
[2025-04-14T15:46:55,423][WARN ][o.e.t.TransportService   ] [Mont Tondu] Received response for a request that has timed out, sent [13412ms] ago, timed out [3403ms] ago, action [internal:coordination/fault_detection/follower_check], node [{Männliflue}{IS3cAjDDR_6o5kpi0RBEzA}{WfEG3gvGRqioIw-KN-CYdg}{192.168.2.1}{192.168.2.1:4300}{dm}{http_address=192.168.2.1:4200}], id [16462]
[2025-04-14T16:01:42,208][WARN ][o.e.t.TransportService   ] [Mont Tondu] Received response for a request that has timed out, sent [13212ms] ago, timed out [3203ms] ago, action [internal:coordination/fault_detection/follower_check], node [{Männliflue}{IS3cAjDDR_6o5kpi0RBEzA}{WfEG3gvGRqioIw-KN-CYdg}{192.168.2.1}{192.168.2.1:4300}{dm}{http_address=192.168.2.1:4200}], id [29224]
[2025-04-14T16:03:04,640][WARN ][o.e.t.TransportService   ] [Mont Tondu] Received response for a request that has timed out, sent [13412ms] ago, timed out [3403ms] ago, action [internal:coordination/fault_detection/follower_check], node [{Männliflue}{IS3cAjDDR_6o5kpi0RBEzA}{WfEG3gvGRqioIw-KN-CYdg}{192.168.2.1}{192.168.2.1:4300}{dm}{http_address=192.168.2.1:4200}], id [29417]
[2025-04-14T16:03:46,880][WARN ][o.e.t.TransportService   ] [Mont Tondu] Received response for a request that has timed out, sent [26825ms] ago, timed out [16815ms] ago, action [internal:coordination/fault_detection/follower_check], node [{Männliflue}{IS3cAjDDR_6o5kpi0RBEzA}{WfEG3gvGRqioIw-KN-CYdg}{192.168.2.1}{192.168.2.1:4300}{dm}{http_address=192.168.2.1:4200}], id [29441]
[2025-04-14T16:03:46,880][WARN ][o.e.t.TransportService   ] [Mont Tondu] Received response for a request that has timed out, sent [15814ms] ago, timed out [5805ms] ago, action [internal:coordination/fault_detection/follower_check], node [{Männliflue}{IS3cAjDDR_6o5kpi0RBEzA}{WfEG3gvGRqioIw-KN-CYdg}{192.168.2.1}{192.168.2.1:4300}{dm}{http_address=192.168.2.1:4200}], id [29443]
[2025-04-14T16:05:22,867][INFO ][o.e.c.r.a.AllocationService] [Mont Tondu] updating number_of_replicas to [0] for indices [ocell.user, ocell..partitioned.sensor_location.04732dhn68qj6c9i60o30c1g, ocell.data_source_type, ocell..partitioned.message.04732d9n6ss36dho60o30c1g, ocell.housing, ocell..partitioned.sensor_location.04732dhg74q3ae9i60o30c1g, ocell.synthetic_indicator_source, ocell..partitioned.indicators.04732dhn68qj6c9i60o30c1g, ocell..partitioned.message.04732d1i60o3ec1k60o30c1g, ocell..partitioned.indicators.04732cpi6kpjedhg60o30c1g, ocell.camping, ocell..partitioned.message.04732dpg6go3cdpi60o30c1g, ocell.area, ocell.events_log, ocell..partitioned.message.04732d9h6grjcd1o60o30c1g, ocell..partitioned.indicators.04732cpo70qj6d1k60o30c1g, ocell.gateway_status, ocell.gateway_type, ocell..partitioned.sensor_location.04732d9n6ss36dho60o30c1g, ocell.monthly_report, ocell.sensor, ocell.service_type, ocell..partitioned.indicators.04732dpj6kr3ge9m60o30c1g, ocell.pms_data, ocell.notification_comment, ocell.notification, ocell..partitioned.indicators.04732dhg74q3ae9i60o30c1g, ocell..partitioned.indicators.04732d9h6grjcd1o60o30c1g, ocell..partitioned.indicators.04732d1l64r30dhk60o30c1g, ocell..partitioned.sensor_location.04732d9h6grjcd1o60o30c1g, ocell..partitioned.message.04732dhn68qj6c9i60o30c1g, ocell..partitioned.indicators.04732d1o6cp34e1o60o30c1g, ocell..partitioned.sensor_location.04732dpg6go3cdpi60o30c1g, ocell..partitioned.indicators.04732d9k6opj0c1o60o30c1g, ocell..partitioned.message.04732dhk60sjid9i60o30c1g, ocell..partitioned.sensor_location.04732dpj6kr3ge9m60o30c1g, ocell.user_login, ocell..partitioned.indicators.04732dhk60sjid9i60o30c1g, ocell.location, ocell..partitioned.message.04732dpj6kr3ge9m60o30c1g, ocell..partitioned.indicators.04732dpg6go3cdpi60o30c1g, ocell.data_source, ocell..partitioned.message.04732dhg74q3ae9i60o30c1g, ocell..partitioned.message.04732cpo70qj6d1k60o30c1g, ocell.user_group, ocell.site, ocell..partitioned.message.04732d1o6cp34e1o60o30c1g, ocell.sms_log, ocell.gateway, ocell..partitioned.sensor_location.04732d1o6cp34e1o60o30c1g, ocell..partitioned.sensor_location.04732d9k6opj0c1o60o30c1g, ocell..partitioned.indicators.0472qdpl6spjgchk60o30c1g, ocell.api_token, ocell..partitioned.message.04732d1l64r30dhk60o30c1g, ocell.indicator_source, ocell.energy_conversion_factor, ocell.provider_outage, ocell..partitioned.sensor_location.04732dhk60sjid9i60o30c1g, ocell..partitioned.message.04732d9k6opj0c1o60o30c1g, ocell..partitioned.indicators.04732d9n6ss36dho60o30c1g, ocell.message_type, ocell..partitioned.indicators.04732d1i60o3ec1k60o30c1g, ocell.info_client]
[2025-04-14T16:05:22,872][INFO ][o.e.c.r.a.AllocationService] [Mont Tondu] Cluster health status changed from [YELLOW] to [GREEN] (reason: [{Männliflue}{IS3cAjDDR_6o5kpi0RBEzA}{WfEG3gvGRqioIw-KN-CYdg}{192.168.2.1}{192.168.2.1:4300}{dm}{http_address=192.168.2.1:4200} followers check retry count exceeded]).

But after a few minutes only, I’m starting seeing these messages on the other node:

[2025-04-14T16:05:37,258][INFO ][o.e.c.s.ClusterApplierService] [Männliflue] master node changed {previous [{Mont Tondu}{rjcyAsKBQXOJo8JkfVygeg}{J-HNIfwrTlule6sbilMK6Q}{192.168.2.12}{192.168.2.12:4300}{dm}{http_address=192.168.2.12:4200}], current []}, term: 21, version: 151439, reason: becoming candidate: onLeaderFailure
[2025-04-14T16:05:39,407][INFO ][o.e.c.s.ClusterApplierService] [Männliflue] master node changed {previous [], current [{Mont Tondu}{rjcyAsKBQXOJo8JkfVygeg}{J-HNIfwrTlule6sbilMK6Q}{192.168.2.12}{192.168.2.12:4300}{dm}{http_address=192.168.2.12:4200}]}, term: 21, version: 151444, reason: ApplyCommitRequest{term=21, version=151444, sourceNode={Mont Tondu}{rjcyAsKBQXOJo8JkfVygeg}{J-HNIfwrTlule6sbilMK6Q}{192.168.2.12}{192.168.2.12:4300}{dm}{http_address=192.168.2.12:4200}}
[2025-04-14T16:05:53,935][WARN ][o.e.t.TransportService   ] [Männliflue] Received response for a request that has timed out, sent [26618ms] ago, timed out [16611ms] ago, action [internal:coordination/fault_detection/leader_check], node [{Mont Tondu}{rjcyAsKBQXOJo8JkfVygeg}{J-HNIfwrTlule6sbilMK6Q}{192.168.2.12}{192.168.2.12:4300}{dm}{http_address=192.168.2.12:4200}], id [3506262]
[2025-04-14T16:05:53,935][WARN ][o.e.t.TransportService   ] [Männliflue] Received response for a request that has timed out, sent [16211ms] ago, timed out [6204ms] ago, action [internal:coordination/fault_detection/leader_check], node [{Mont Tondu}{rjcyAsKBQXOJo8JkfVygeg}{J-HNIfwrTlule6sbilMK6Q}{192.168.2.12}{192.168.2.12:4300}{dm}{http_address=192.168.2.12:4200}], id [3506430]
[2025-04-14T16:12:03,861][WARN ][o.e.t.TransportService   ] [Männliflue] Received response for a request that has timed out, sent [3402ms] ago, timed out [400ms] ago, action [internal:crate:sql/sys/nodes], node [{Mont Tondu}{rjcyAsKBQXOJo8JkfVygeg}{J-HNIfwrTlule6sbilMK6Q}{192.168.2.12}{192.168.2.12:4300}{dm}{http_address=192.168.2.12:4200}], id [3507236]

[2025-04-14T16:23:31,040][INFO ][o.e.c.c.Coordinator      ] [Männliflue] master node [{Mont Tondu}{rjcyAsKBQXOJo8JkfVygeg}{J-HNIfwrTlule6sbilMK6Q}{192.168.2.12}{192.168.2.12:4300}{dm}{http_address=192.168.2.12:4200}] failed, restarting discovery
org.elasticsearch.ElasticsearchException: node [{Mont Tondu}{rjcyAsKBQXOJo8JkfVygeg}{J-HNIfwrTlule6sbilMK6Q}{192.168.2.12}{192.168.2.12:4300}{dm}{http_address=192.168.2.12:4200}] failed [3] consecutive checks
        at org.elasticsearch.cluster.coordination.LeaderChecker$CheckScheduler$1.handleException(LeaderChecker.java:267) ~[crate-server-5.6.5.jar:?]
        at org.elasticsearch.transport.TransportService$5.handleException(TransportService.java:520) ~[crate-server-5.6.5.jar:?]
        at org.elasticsearch.transport.TransportService$TimeoutResponseHandler.handleException(TransportService.java:1020) ~[crate-server-5.6.5.jar:?]
        at org.elasticsearch.transport.TransportService$TimeoutHandler.run(TransportService.java:928) ~[crate-server-5.6.5.jar:?]
        at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1144) ~[?:?]
        at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:642) ~[?:?]
        at java.base/java.lang.Thread.run(Thread.java:1583) [?:?]
Caused by: org.elasticsearch.transport.ReceiveTimeoutTransportException: [Mont Tondu][192.168.2.12:4300][internal:coordination/fault_detection/leader_check] request_id [3508075] timed out after [10007ms]
        at org.elasticsearch.transport.TransportService$TimeoutHandler.run(TransportService.java:929) ~[crate-server-5.6.5.jar:?]
        ... 3 more
[2025-04-14T16:23:31,041][INFO ][o.e.c.s.ClusterApplierService] [Männliflue] master node changed {previous [{Mont Tondu}{rjcyAsKBQXOJo8JkfVygeg}{J-HNIfwrTlule6sbilMK6Q}{192.168.2.12}{192.168.2.12:4300}{dm}{http_address=192.168.2.12:4200}], current []}, term: 21, version: 151579, reason: becoming candidate: onLeaderFailure
[2025-04-14T16:23:32,098][INFO ][o.e.c.s.ClusterApplierService] [Männliflue] master node changed {previous [], current [{Mont Tondu}{rjcyAsKBQXOJo8JkfVygeg}{J-HNIfwrTlule6sbilMK6Q}{192.168.2.12}{192.168.2.12:4300}{dm}{http_address=192.168.2.12:4200}]}, term: 21, version: 151580, reason: ApplyCommitRequest{term=21, version=151580, sourceNode={Mont Tondu}{rjcyAsKBQXOJo8JkfVygeg}{J-HNIfwrTlule6sbilMK6Q}{192.168.2.12}{192.168.2.12:4300}{dm}{http_address=192.168.2.12:4200}}
[2025-04-14T16:23:51,695][WARN ][o.e.t.TransportService   ] [Männliflue] Received response for a request that has timed out, sent [52835ms] ago, timed out [42828ms] ago, action [internal:coordination/fault_detection/leader_check], node [{Mont Tondu}{rjcyAsKBQXOJo8JkfVygeg}{J-HNIfwrTlule6sbilMK6Q}{192.168.2.12}{192.168.2.12:4300}{dm}{http_address=192.168.2.12:4200}], id [3508073]
[2025-04-14T16:23:51,695][WARN ][o.e.t.TransportService   ] [Männliflue] Received response for a request that has timed out, sent [41828ms] ago, timed out [31822ms] ago, action [internal:coordination/fault_detection/leader_check], node [{Mont Tondu}{rjcyAsKBQXOJo8JkfVygeg}{J-HNIfwrTlule6sbilMK6Q}{192.168.2.12}{192.168.2.12:4300}{dm}{http_address=192.168.2.12:4200}], id [3508074]
[2025-04-14T16:23:51,695][WARN ][o.e.t.TransportService   ] [Männliflue] Received response for a request that has timed out, sent [30821ms] ago, timed out [20814ms] ago, action [internal:coordination/fault_detection/leader_check], node [{Mont Tondu}{rjcyAsKBQXOJo8JkfVygeg}{J-HNIfwrTlule6sbilMK6Q}{192.168.2.12}{192.168.2.12:4300}{dm}{http_address=192.168.2.12:4200}], id [3508075]
[2025-04-14T16:23:51,695][WARN ][o.e.t.TransportService   ] [Männliflue] Received response for a request that has timed out, sent [20614ms] ago, timed out [10607ms] ago, action [internal:coordination/fault_detection/leader_check], node [{Mont Tondu}{rjcyAsKBQXOJo8JkfVygeg}{J-HNIfwrTlule6sbilMK6Q}{192.168.2.12}{192.168.2.12:4300}{dm}{http_address=192.168.2.12:4200}], id [3508082]
[2025-04-14T16:24:22,201][INFO ][o.e.c.c.Coordinator      ] [Männliflue] master node [{Mont Tondu}{rjcyAsKBQXOJo8JkfVygeg}{J-HNIfwrTlule6sbilMK6Q}{192.168.2.12}{192.168.2.12:4300}{dm}{http_address=192.168.2.12:4200}] failed, restarting discovery
org.elasticsearch.ElasticsearchException: node [{Mont Tondu}{rjcyAsKBQXOJo8JkfVygeg}{J-HNIfwrTlule6sbilMK6Q}{192.168.2.12}{192.168.2.12:4300}{dm}{http_address=192.168.2.12:4200}] failed [3] consecutive checks
        at org.elasticsearch.cluster.coordination.LeaderChecker$CheckScheduler$1.handleException(LeaderChecker.java:267) ~[crate-server-5.6.5.jar:?]
        at org.elasticsearch.transport.TransportService$5.handleException(TransportService.java:520) ~[crate-server-5.6.5.jar:?]
        at org.elasticsearch.transport.TransportService$TimeoutResponseHandler.handleException(TransportService.java:1020) ~[crate-server-5.6.5.jar:?]
        at org.elasticsearch.transport.InboundHandler$1.doRun(InboundHandler.java:287) ~[crate-server-5.6.5.jar:?]
        at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37) ~[crate-server-5.6.5.jar:?]
        at org.elasticsearch.common.util.concurrent.EsExecutors$DirectExecutor.execute(EsExecutors.java:160) ~[crate-server-5.6.5.jar:?]
        at org.elasticsearch.transport.InboundHandler.handleException(InboundHandler.java:282) ~[crate-server-5.6.5.jar:?]
        at org.elasticsearch.transport.InboundHandler.handlerResponseError(InboundHandler.java:274) ~[crate-server-5.6.5.jar:?]
        at org.elasticsearch.transport.InboundHandler.messageReceived(InboundHandler.java:131) ~[crate-server-5.6.5.jar:?]
        at org.elasticsearch.transport.InboundHandler.inboundMessage(InboundHandler.java:96) ~[crate-server-5.6.5.jar:?]
        at org.elasticsearch.transport.TcpTransport.inboundMessage(TcpTransport.java:675) ~[crate-server-5.6.5.jar:?]
        at org.elasticsearch.transport.InboundPipeline.forwardFragments(InboundPipeline.java:143) ~[crate-server-5.6.5.jar:?] 
        at org.elasticsearch.transport.InboundPipeline.doHandleBytes(InboundPipeline.java:118) ~[crate-server-5.6.5.jar:?]
        at org.elasticsearch.transport.InboundPipeline.handleBytes(InboundPipeline.java:83) ~[crate-server-5.6.5.jar:?]
        at org.elasticsearch.transport.netty4.Netty4MessageChannelHandler.channelRead(Netty4MessageChannelHandler.java:71) ~[crate-server-5.6.5.jar:?]
        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:442) ~[netty-transport-4.1.104.Final.jar:4.1.104.Final]
        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:420) ~[netty-transport-4.1.104.Final.jar:4.1.104.Final]
        at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:412) ~[netty-transport-4.1.104.Final.jar:4.1.104.Final]
        at io.netty.handler.logging.LoggingHandler.channelRead(LoggingHandler.java:280) ~[netty-handler-4.1.104.Final.jar:4.1.104.Final]
        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:442) ~[netty-transport-4.1.104.Final.jar:4.1.104.Final]
        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:420) ~[netty-transport-4.1.104.Final.jar:4.1.104.Final]
        at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:412) ~[netty-transport-4.1.104.Final.jar:4.1.104.Final]
        at io.netty.channel.DefaultChannelPipeline$HeadContext.channelRead(DefaultChannelPipeline.java:1410) ~[netty-transport-4.1.104.Final.jar:4.1.104.Final]
        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:440) ~[netty-transport-4.1.104.Final.jar:4.1.104.Final]
        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:420) ~[netty-transport-4.1.104.Final.jar:4.1.104.Final]
        at io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:919) ~[netty-transport-4.1.104.Final.jar:4.1.104.Final]
        at io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:166) ~[netty-transport-4.1.104.Final.jar:4.1.104.Final]
        at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:788) ~[netty-transport-4.1.104.Final.jar:4.1.104.Final]
        at io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:724) ~[netty-transport-4.1.104.Final.jar:4.1.104.Final]
        at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:650) ~[netty-transport-4.1.104.Final.jar:4.1.104.Final]
        at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:562) ~[netty-transport-4.1.104.Final.jar:4.1.104.Final]
        at io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:997) ~[netty-common-4.1.104.Final.jar:4.1.104.Final]
        at io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74) ~[netty-common-4.1.104.Final.jar:4.1.104.Final]
        at java.base/java.lang.Thread.run(Thread.java:1583) [?:?

As if the network was looking packets … but as I’ve stated the two machines are directly linked by an ethernet cable and on both hosts I do not see any warning/error in dmesg about networking (soft or hard) issues …

Also when I check on the web-interface everything appears to be ok (cluster is green) but I see that some of my services are stuck while inserting data … I need to restart one node for the cluster to back to normal until it fails again after a while.

Does anyone has a clue about what could be causing this ?
Thanks

Hi Charles, nothing in particular comes to mind, but as this is happening on your test environment and version 5.6.5 is a bit dated now, my suggestion would be to try if the issue persists if you upgrade to one of the latest versions, would that be possible for you?

Hi Hernan,
thanks for the reply. I had try to upgrade to 5.9.10 a few weeks ago but I’ve had suspicion that the new version was using more memory than 5.6.x so I had to revert back to 5.6.x because I did not have the time to perform precise testing and move this version also to production (I have to run the same version on both test & production environment)
Do you have information about a change in memory footprint between 5.6.x & 5.9.x ?
Thanks.

There was nothing increasing memory usage, quite the opposite, plenty of optimizations which should bring better performance, could you try 5.10.4 instead of 5.9.10 ?

I could try … I thought I had read something about JOIN performances optimization that could lead to performance decrease in tables created in earlier versions of crate and that would require table re-creation in order to work properly but It seems I cannot find it anymore. Do you have information about this ?