2 node cluster recurrent issue with request timing out

echarlus · April 14, 2025, 2:21pm

Hi All,
I’m running a 2 nodes integration-testing cluster on version 5.6.5, same version as my production version.
I’ve got a very basic setup on each node :

# Networking: Bind to an IP address or interface other than localhost.
# Be careful! Never expose an unprotected node to the internet.
# Choose from [IP Address], _local_, _site_, _global_ or _[networkInterface]_
#network.host: _site_
#network.host: _local_
network.host: 0.0.0.0

# Cluster discovery: Specify the hosts which will form the CrateDB cluster
discovery.seed_hosts:
  - 192.168.2.1
  - 192.168.2.12

# Bootstrap the cluster using an initial set of master-eligible nodes. All
# master-eligible nodes must be set here for new (or upgraded from CrateDB < 4.x)
# production (non loop-back bound) clusters otherwise the cluster is not able to
# initially vote an initial master node:
#
cluster.initial_master_nodes:
  - 192.168.2.1
  - 192.168.2.12

The two nodes are Linux hosts (one Fedora 40, one Debian 12) directly linked by a 1Gbps ethernet cable, no switch …

I’m constantly running into a synchronization issue between the 2 nodes and I cannot pinpoint what the issue is. Here’s a startup sequence on one of the two nodes. One can see that a master is elected and the cluster performs a sync and turns to green.

[2025-04-14T15:41:57,647][INFO ][o.e.e.NodeEnvironment    ] [Mont Tondu] using [1] data paths, mounts [[/home (/dev/mapper/osuperset--vg-home)]], net usable_space [663.2gb], net total_space [880gb], types [ext4]
[2025-04-14T15:41:57,650][INFO ][o.e.e.NodeEnvironment    ] [Mont Tondu] heap size [3gb], compressed ordinary object pointers [true]
[2025-04-14T15:41:57,982][INFO ][o.e.n.Node               ] [Mont Tondu] node name [Mont Tondu], node ID [rjcyAsKBQXOJo8JkfVygeg], cluster name [o-cell-test]
[2025-04-14T15:41:57,983][INFO ][o.e.n.Node               ] [Mont Tondu] version[5.6.5], pid[48244], build[5db11af/NA], OS[Linux/6.1.0-32-amd64/amd64], JVM[Eclipse Adoptium/OpenJDK 64-Bit Server VM/21.0.3+9-LTS]
[2025-04-14T15:41:58,715][INFO ][o.e.p.PluginsService     ] [Mont Tondu] no modules loaded
[2025-04-14T15:41:58,718][INFO ][o.e.p.PluginsService     ] [Mont Tondu] loaded plugin [cr8-copy-s3]
[2025-04-14T15:41:58,718][INFO ][o.e.p.PluginsService     ] [Mont Tondu] loaded plugin [crate-functions]
[2025-04-14T15:41:58,719][INFO ][o.e.p.PluginsService     ] [Mont Tondu] loaded plugin [crate-jmx-monitoring]
[2025-04-14T15:41:58,719][INFO ][o.e.p.PluginsService     ] [Mont Tondu] loaded plugin [crate-lang-js]
[2025-04-14T15:41:58,719][INFO ][o.e.p.PluginsService     ] [Mont Tondu] loaded plugin [es-analysis-common]
[2025-04-14T15:41:58,719][INFO ][o.e.p.PluginsService     ] [Mont Tondu] loaded plugin [es-analysis-phonetic]
[2025-04-14T15:41:58,719][INFO ][o.e.p.PluginsService     ] [Mont Tondu] loaded plugin [es-repository-azure]
[2025-04-14T15:41:58,720][INFO ][o.e.p.PluginsService     ] [Mont Tondu] loaded plugin [io.crate.plugin.SrvPlugin]
[2025-04-14T15:41:58,720][INFO ][o.e.p.PluginsService     ] [Mont Tondu] loaded plugin [io.crate.udc.plugin.UDCPlugin]
[2025-04-14T15:41:58,720][INFO ][o.e.p.PluginsService     ] [Mont Tondu] loaded plugin [org.elasticsearch.discovery.ec2.Ec2DiscoveryPlugin]
[2025-04-14T15:41:58,720][INFO ][o.e.p.PluginsService     ] [Mont Tondu] loaded plugin [org.elasticsearch.plugin.repository.url.URLRepositoryPlugin]
[2025-04-14T15:41:58,720][INFO ][o.e.p.PluginsService     ] [Mont Tondu] loaded plugin [org.elasticsearch.repositories.s3.S3RepositoryPlugin]
[2025-04-14T15:41:58,720][INFO ][o.e.p.PluginsService     ] [Mont Tondu] loaded plugin [org.elasticsearch.transport.Netty4Plugin]
[2025-04-14T15:41:59,755][INFO ][o.e.d.DiscoveryModule    ] [Mont Tondu] using discovery type [zen] and seed hosts providers [settings]
[2025-04-14T15:42:01,182][INFO ][psql                     ] [Mont Tondu] PSQL SSL support is disabled.
[2025-04-14T15:42:01,539][INFO ][i.c.n.c.PipelineRegistry ] [Mont Tondu] HTTP SSL support is disabled.
[2025-04-14T15:42:01,572][WARN ][o.e.g.DanglingIndicesState] [Mont Tondu] gateway.auto_import_dangling_indices is disabled, dangling indices will not be detected or imported
[2025-04-14T15:42:01,652][INFO ][o.e.n.Node               ] [Mont Tondu] initialized
[2025-04-14T15:42:01,653][INFO ][o.e.n.Node               ] [Mont Tondu] starting ...
[2025-04-14T15:42:01,719][INFO ][psql                     ] [Mont Tondu] publish_address {192.168.2.12:5432}, bound_addresses {[::]:5432}
[2025-04-14T15:42:01,727][INFO ][o.e.h.n.Netty4HttpServerTransport] [Mont Tondu] publish_address {192.168.2.12:4200}, bound_addresses {[::]:4200}
[2025-04-14T15:42:01,735][INFO ][o.e.t.TransportService   ] [Mont Tondu] publish_address {192.168.2.12:4300}, bound_addresses {[::]:4300}
[2025-04-14T15:42:02,381][INFO ][o.e.b.BootstrapChecks    ] [Mont Tondu] bound or publishing to a non-loopback address, enforcing bootstrap checks
[2025-04-14T15:42:02,386][INFO ][o.e.c.c.Coordinator      ] [Mont Tondu] cluster UUID [FAQR5_pWS3uiyaQaKTiVdQ]

[2025-04-14T16:05:43,968][WARN ][o.e.c.r.a.AllocationService] [Mont Tondu] [ocell..partitioned.message.04732dhg74q3ae9i60o30c1g][0] marking unavailable shards as stale: [FONfliKKQAaPDQ5CQvY1iA]
[2025-04-14T15:42:02,568][INFO ][o.e.c.s.MasterService    ] [Mont Tondu] elected-as-master ([1] nodes joined)[{Mont Tondu}{rjcyAsKBQXOJo8JkfVygeg}{J-HNIfwrTlule6sbilMK6Q}{192.168.2.12}{192.168.2.12:4300}{dm}{http_address=192.168.2.12:4200} elect leader, _BECOME_MASTER_TASK_, _FINISH_ELECTION_], term: 21, version: 151301, reason: master node changed {previous [], current [{Mont Tondu}{rjcyAsKBQXOJo8JkfVygeg}{J-HNIfwrTlule6sbilMK6Q}{192.168.2.12}{192.168.2.12:4300}{dm}{http_address=192.168.2.12:4200}]}
[2025-04-14T15:42:02,887][INFO ][o.e.c.s.ClusterApplierService] [Mont Tondu] master node changed {previous [], current [{Mont Tondu}{rjcyAsKBQXOJo8JkfVygeg}{J-HNIfwrTlule6sbilMK6Q}{192.168.2.12}{192.168.2.12:4300}{dm}{http_address=192.168.2.12:4200}]}, term: 21, version: 151301, reason: Publication{term=21, version=151301}
[2025-04-14T15:42:02,899][INFO ][o.e.n.Node               ] [Mont Tondu] started
[2025-04-14T15:42:03,118][INFO ][o.e.c.r.a.AllocationService] [Mont Tondu] updating number_of_replicas to [1] for indices [ocell.user, ocell..partitioned.sensor_location.04732dhn68qj6c9i60o30c1g, ocell.data_source_type, ocell..partitioned.message.04732d9n6ss36dho60o30c1g, ocell.housing, ocell..partitioned.sensor_location.04732dhg74q3ae9i60o30c1g, ocell.synthetic_indicator_source, ocell..partitioned.indicators.04732dhn68qj6c9i60o30c1g, ocell..partitioned.message.04732d1i60o3ec1k60o30c1g, ocell..partitioned.indicators.04732cpi6kpjedhg60o30c1g, ocell.camping, ocell..partitioned.message.04732dpg6go3cdpi60o30c1g, ocell.area, ocell.events_log, ocell..partitioned.message.04732d9h6grjcd1o60o30c1g, ocell..partitioned.indicators.04732cpo70qj6d1k60o30c1g, ocell.gateway_status, ocell.gateway_type, ocell..partitioned.sensor_location.04732d9n6ss36dho60o30c1g, ocell.monthly_report, ocell.sensor, ocell.service_type, ocell..partitioned.indicators.04732dpj6kr3ge9m60o30c1g, ocell.pms_data, ocell.notification_comment, ocell.notification, ocell..partitioned.indicators.04732dhg74q3ae9i60o30c1g, ocell..partitioned.indicators.04732d9h6grjcd1o60o30c1g, ocell..partitioned.indicators.04732d1l64r30dhk60o30c1g, ocell..partitioned.sensor_location.04732d9h6grjcd1o60o30c1g, ocell..partitioned.message.04732dhn68qj6c9i60o30c1g, ocell..partitioned.indicators.04732d1o6cp34e1o60o30c1g, ocell..partitioned.sensor_location.04732dpg6go3cdpi60o30c1g, ocell..partitioned.indicators.04732d9k6opj0c1o60o30c1g, ocell..partitioned.message.04732dhk60sjid9i60o30c1g, ocell..partitioned.sensor_location.04732dpj6kr3ge9m60o30c1g, ocell.user_login, ocell..partitioned.indicators.04732dhk60sjid9i60o30c1g, ocell.location, ocell..partitioned.message.04732dpj6kr3ge9m60o30c1g, ocell..partitioned.indicators.04732dpg6go3cdpi60o30c1g, ocell.data_source, ocell..partitioned.message.04732dhg74q3ae9i60o30c1g, ocell..partitioned.message.04732cpo70qj6d1k60o30c1g, ocell.user_group, ocell.site, ocell..partitioned.message.04732d1o6cp34e1o60o30c1g, ocell.sms_log, ocell.gateway, ocell..partitioned.sensor_location.04732d1o6cp34e1o60o30c1g, ocell..partitioned.sensor_location.04732d9k6opj0c1o60o30c1g, ocell..partitioned.indicators.0472qdpl6spjgchk60o30c1g, ocell.api_token, ocell..partitioned.message.04732d1l64r30dhk60o30c1g, ocell.indicator_source, ocell.energy_conversion_factor, ocell.provider_outage, ocell..partitioned.sensor_location.04732dhk60sjid9i60o30c1g, ocell..partitioned.message.04732d9k6opj0c1o60o30c1g, ocell..partitioned.indicators.04732d9n6ss36dho60o30c1g, ocell.message_type, ocell..partitioned.indicators.04732d1i60o3ec1k60o30c1g, ocell.info_client]
[2025-04-14T15:42:03,120][INFO ][o.e.c.s.MasterService    ] [Mont Tondu] node-join[{Männliflue}{IS3cAjDDR_6o5kpi0RBEzA}{WfEG3gvGRqioIw-KN-CYdg}{192.168.2.1}{192.168.2.1:4300}{dm}{http_address=192.168.2.1:4200} join existing leader], term: 21, version: 151302, reason: added {{Männliflue}{IS3cAjDDR_6o5kpi0RBEzA}{WfEG3gvGRqioIw-KN-CYdg}{192.168.2.1}{192.168.2.1:4300}{dm}{http_address=192.168.2.1:4200},}
[2025-04-14T15:42:07,505][INFO ][o.e.c.s.ClusterApplierService] [Mont Tondu] added {{Männliflue}{IS3cAjDDR_6o5kpi0RBEzA}{WfEG3gvGRqioIw-KN-CYdg}{192.168.2.1}{192.168.2.1:4300}{dm}{http_address=192.168.2.1:4200},}, term: 21, version: 151302, reason: Publication{term=21, version=151302}
[2025-04-14T15:42:10,204][INFO ][o.e.c.s.ClusterSettings  ] [Mont Tondu] updating [stats.service.interval] from [24h] to [0]
[2025-04-14T15:42:10,222][INFO ][o.e.g.GatewayService     ] [Mont Tondu] recovered [63] indices into cluster_state
[2025-04-14T15:42:19,834][WARN ][o.e.c.r.a.AllocationService] [Mont Tondu] [ocell..partitioned.sensor_location.04732dpj6kr3ge9m60o30c1g][0] marking unavailable shards as stale: [t4afzN2XT4-Or0H4pddogw]
[2025-04-14T15:42:19,835][WARN ][o.e.c.r.a.AllocationService] [Mont Tondu] [ocell..partitioned.message.04732dpj6kr3ge9m60o30c1g][0] marking unavailable shards as stale: [6iAJlz2XRBS4YEKA7u8vIw]
[2025-04-14T15:42:28,105][WARN ][o.e.c.r.a.AllocationService] [Mont Tondu] [ocell.sensor][0] marking unavailable shards as stale: [J3sUjwb3S_6bVGvrERwQtw]
[2025-04-14T15:42:36,887][WARN ][o.e.c.r.a.AllocationService] [Mont Tondu] [ocell.notification][0] marking unavailable shards as stale: [w6C5Wrf7SOCXgKwHetUNSQ]
[2025-04-14T15:42:52,027][INFO ][o.e.c.r.a.AllocationService] [Mont Tondu] Cluster health status changed from [RED] to [YELLOW] (reason: [shards started [[ocell..partitioned.indicators.04732d1i60o3ec1k60o30c1g][0]]]).
[2025-04-14T15:42:55,040][WARN ][o.e.t.TransportService   ] [Mont Tondu] Received response for a request that has timed out, sent [13425ms] ago, timed out [3402ms] ago, action [internal:coordination/fault_detection/follower_check], node [{Männliflue}{IS3cAjDDR_6o5kpi0RBEzA}{WfEG3gvGRqioIw-KN-CYdg}{192.168.2.1}{192.168.2.1:4300}{dm}{http_address=192.168.2.1:4200}], id [421]
[2025-04-14T15:44:54,848][WARN ][o.e.t.TransportService   ] [Mont Tondu] Received response for a request that has timed out, sent [26625ms] ago, timed out [16616ms] ago, action [internal:coordination/fault_detection/follower_check], node [{Männliflue}{IS3cAjDDR_6o5kpi0RBEzA}{WfEG3gvGRqioIw-KN-CYdg}{192.168.2.1}{192.168.2.1:4300}{dm}{http_address=192.168.2.1:4200}], id [8310]
[2025-04-14T15:44:54,849][WARN ][o.e.t.TransportService   ] [Mont Tondu] Received response for a request that has timed out, sent [15615ms] ago, timed out [5606ms] ago, action [internal:coordination/fault_detection/follower_check], node [{Männliflue}{IS3cAjDDR_6o5kpi0RBEzA}{WfEG3gvGRqioIw-KN-CYdg}{192.168.2.1}{192.168.2.1:4300}{dm}{http_address=192.168.2.1:4200}], id [8343]
[2025-04-14T15:46:29,733][INFO ][o.e.c.r.a.AllocationService] [Mont Tondu] Cluster health status changed from [YELLOW] to [GREEN] (reason: [shards started [[ocell..partitioned.indicators.04732dhg74q3ae9i60o30c1g][0]]]).
[2025-04-14T15:46:55,423][WARN ][o.e.t.TransportService   ] [Mont Tondu] Received response for a request that has timed out, sent [13412ms] ago, timed out [3403ms] ago, action [internal:coordination/fault_detection/follower_check], node [{Männliflue}{IS3cAjDDR_6o5kpi0RBEzA}{WfEG3gvGRqioIw-KN-CYdg}{192.168.2.1}{192.168.2.1:4300}{dm}{http_address=192.168.2.1:4200}], id [16462]
[2025-04-14T16:01:42,208][WARN ][o.e.t.TransportService   ] [Mont Tondu] Received response for a request that has timed out, sent [13212ms] ago, timed out [3203ms] ago, action [internal:coordination/fault_detection/follower_check], node [{Männliflue}{IS3cAjDDR_6o5kpi0RBEzA}{WfEG3gvGRqioIw-KN-CYdg}{192.168.2.1}{192.168.2.1:4300}{dm}{http_address=192.168.2.1:4200}], id [29224]
[2025-04-14T16:03:04,640][WARN ][o.e.t.TransportService   ] [Mont Tondu] Received response for a request that has timed out, sent [13412ms] ago, timed out [3403ms] ago, action [internal:coordination/fault_detection/follower_check], node [{Männliflue}{IS3cAjDDR_6o5kpi0RBEzA}{WfEG3gvGRqioIw-KN-CYdg}{192.168.2.1}{192.168.2.1:4300}{dm}{http_address=192.168.2.1:4200}], id [29417]
[2025-04-14T16:03:46,880][WARN ][o.e.t.TransportService   ] [Mont Tondu] Received response for a request that has timed out, sent [26825ms] ago, timed out [16815ms] ago, action [internal:coordination/fault_detection/follower_check], node [{Männliflue}{IS3cAjDDR_6o5kpi0RBEzA}{WfEG3gvGRqioIw-KN-CYdg}{192.168.2.1}{192.168.2.1:4300}{dm}{http_address=192.168.2.1:4200}], id [29441]
[2025-04-14T16:03:46,880][WARN ][o.e.t.TransportService   ] [Mont Tondu] Received response for a request that has timed out, sent [15814ms] ago, timed out [5805ms] ago, action [internal:coordination/fault_detection/follower_check], node [{Männliflue}{IS3cAjDDR_6o5kpi0RBEzA}{WfEG3gvGRqioIw-KN-CYdg}{192.168.2.1}{192.168.2.1:4300}{dm}{http_address=192.168.2.1:4200}], id [29443]
[2025-04-14T16:05:22,867][INFO ][o.e.c.r.a.AllocationService] [Mont Tondu] updating number_of_replicas to [0] for indices [ocell.user, ocell..partitioned.sensor_location.04732dhn68qj6c9i60o30c1g, ocell.data_source_type, ocell..partitioned.message.04732d9n6ss36dho60o30c1g, ocell.housing, ocell..partitioned.sensor_location.04732dhg74q3ae9i60o30c1g, ocell.synthetic_indicator_source, ocell..partitioned.indicators.04732dhn68qj6c9i60o30c1g, ocell..partitioned.message.04732d1i60o3ec1k60o30c1g, ocell..partitioned.indicators.04732cpi6kpjedhg60o30c1g, ocell.camping, ocell..partitioned.message.04732dpg6go3cdpi60o30c1g, ocell.area, ocell.events_log, ocell..partitioned.message.04732d9h6grjcd1o60o30c1g, ocell..partitioned.indicators.04732cpo70qj6d1k60o30c1g, ocell.gateway_status, ocell.gateway_type, ocell..partitioned.sensor_location.04732d9n6ss36dho60o30c1g, ocell.monthly_report, ocell.sensor, ocell.service_type, ocell..partitioned.indicators.04732dpj6kr3ge9m60o30c1g, ocell.pms_data, ocell.notification_comment, ocell.notification, ocell..partitioned.indicators.04732dhg74q3ae9i60o30c1g, ocell..partitioned.indicators.04732d9h6grjcd1o60o30c1g, ocell..partitioned.indicators.04732d1l64r30dhk60o30c1g, ocell..partitioned.sensor_location.04732d9h6grjcd1o60o30c1g, ocell..partitioned.message.04732dhn68qj6c9i60o30c1g, ocell..partitioned.indicators.04732d1o6cp34e1o60o30c1g, ocell..partitioned.sensor_location.04732dpg6go3cdpi60o30c1g, ocell..partitioned.indicators.04732d9k6opj0c1o60o30c1g, ocell..partitioned.message.04732dhk60sjid9i60o30c1g, ocell..partitioned.sensor_location.04732dpj6kr3ge9m60o30c1g, ocell.user_login, ocell..partitioned.indicators.04732dhk60sjid9i60o30c1g, ocell.location, ocell..partitioned.message.04732dpj6kr3ge9m60o30c1g, ocell..partitioned.indicators.04732dpg6go3cdpi60o30c1g, ocell.data_source, ocell..partitioned.message.04732dhg74q3ae9i60o30c1g, ocell..partitioned.message.04732cpo70qj6d1k60o30c1g, ocell.user_group, ocell.site, ocell..partitioned.message.04732d1o6cp34e1o60o30c1g, ocell.sms_log, ocell.gateway, ocell..partitioned.sensor_location.04732d1o6cp34e1o60o30c1g, ocell..partitioned.sensor_location.04732d9k6opj0c1o60o30c1g, ocell..partitioned.indicators.0472qdpl6spjgchk60o30c1g, ocell.api_token, ocell..partitioned.message.04732d1l64r30dhk60o30c1g, ocell.indicator_source, ocell.energy_conversion_factor, ocell.provider_outage, ocell..partitioned.sensor_location.04732dhk60sjid9i60o30c1g, ocell..partitioned.message.04732d9k6opj0c1o60o30c1g, ocell..partitioned.indicators.04732d9n6ss36dho60o30c1g, ocell.message_type, ocell..partitioned.indicators.04732d1i60o3ec1k60o30c1g, ocell.info_client]
[2025-04-14T16:05:22,872][INFO ][o.e.c.r.a.AllocationService] [Mont Tondu] Cluster health status changed from [YELLOW] to [GREEN] (reason: [{Männliflue}{IS3cAjDDR_6o5kpi0RBEzA}{WfEG3gvGRqioIw-KN-CYdg}{192.168.2.1}{192.168.2.1:4300}{dm}{http_address=192.168.2.1:4200} followers check retry count exceeded]).

But after a few minutes only, I’m starting seeing these messages on the other node:

[2025-04-14T16:05:37,258][INFO ][o.e.c.s.ClusterApplierService] [Männliflue] master node changed {previous [{Mont Tondu}{rjcyAsKBQXOJo8JkfVygeg}{J-HNIfwrTlule6sbilMK6Q}{192.168.2.12}{192.168.2.12:4300}{dm}{http_address=192.168.2.12:4200}], current []}, term: 21, version: 151439, reason: becoming candidate: onLeaderFailure
[2025-04-14T16:05:39,407][INFO ][o.e.c.s.ClusterApplierService] [Männliflue] master node changed {previous [], current [{Mont Tondu}{rjcyAsKBQXOJo8JkfVygeg}{J-HNIfwrTlule6sbilMK6Q}{192.168.2.12}{192.168.2.12:4300}{dm}{http_address=192.168.2.12:4200}]}, term: 21, version: 151444, reason: ApplyCommitRequest{term=21, version=151444, sourceNode={Mont Tondu}{rjcyAsKBQXOJo8JkfVygeg}{J-HNIfwrTlule6sbilMK6Q}{192.168.2.12}{192.168.2.12:4300}{dm}{http_address=192.168.2.12:4200}}
[2025-04-14T16:05:53,935][WARN ][o.e.t.TransportService   ] [Männliflue] Received response for a request that has timed out, sent [26618ms] ago, timed out [16611ms] ago, action [internal:coordination/fault_detection/leader_check], node [{Mont Tondu}{rjcyAsKBQXOJo8JkfVygeg}{J-HNIfwrTlule6sbilMK6Q}{192.168.2.12}{192.168.2.12:4300}{dm}{http_address=192.168.2.12:4200}], id [3506262]
[2025-04-14T16:05:53,935][WARN ][o.e.t.TransportService   ] [Männliflue] Received response for a request that has timed out, sent [16211ms] ago, timed out [6204ms] ago, action [internal:coordination/fault_detection/leader_check], node [{Mont Tondu}{rjcyAsKBQXOJo8JkfVygeg}{J-HNIfwrTlule6sbilMK6Q}{192.168.2.12}{192.168.2.12:4300}{dm}{http_address=192.168.2.12:4200}], id [3506430]
[2025-04-14T16:12:03,861][WARN ][o.e.t.TransportService   ] [Männliflue] Received response for a request that has timed out, sent [3402ms] ago, timed out [400ms] ago, action [internal:crate:sql/sys/nodes], node [{Mont Tondu}{rjcyAsKBQXOJo8JkfVygeg}{J-HNIfwrTlule6sbilMK6Q}{192.168.2.12}{192.168.2.12:4300}{dm}{http_address=192.168.2.12:4200}], id [3507236]

[2025-04-14T16:23:31,040][INFO ][o.e.c.c.Coordinator      ] [Männliflue] master node [{Mont Tondu}{rjcyAsKBQXOJo8JkfVygeg}{J-HNIfwrTlule6sbilMK6Q}{192.168.2.12}{192.168.2.12:4300}{dm}{http_address=192.168.2.12:4200}] failed, restarting discovery
org.elasticsearch.ElasticsearchException: node [{Mont Tondu}{rjcyAsKBQXOJo8JkfVygeg}{J-HNIfwrTlule6sbilMK6Q}{192.168.2.12}{192.168.2.12:4300}{dm}{http_address=192.168.2.12:4200}] failed [3] consecutive checks
        at org.elasticsearch.cluster.coordination.LeaderChecker$CheckScheduler$1.handleException(LeaderChecker.java:267) ~[crate-server-5.6.5.jar:?]
        at org.elasticsearch.transport.TransportService$5.handleException(TransportService.java:520) ~[crate-server-5.6.5.jar:?]
        at org.elasticsearch.transport.TransportService$TimeoutResponseHandler.handleException(TransportService.java:1020) ~[crate-server-5.6.5.jar:?]
        at org.elasticsearch.transport.TransportService$TimeoutHandler.run(TransportService.java:928) ~[crate-server-5.6.5.jar:?]
        at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1144) ~[?:?]
        at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:642) ~[?:?]
        at java.base/java.lang.Thread.run(Thread.java:1583) [?:?]
Caused by: org.elasticsearch.transport.ReceiveTimeoutTransportException: [Mont Tondu][192.168.2.12:4300][internal:coordination/fault_detection/leader_check] request_id [3508075] timed out after [10007ms]
        at org.elasticsearch.transport.TransportService$TimeoutHandler.run(TransportService.java:929) ~[crate-server-5.6.5.jar:?]
        ... 3 more
[2025-04-14T16:23:31,041][INFO ][o.e.c.s.ClusterApplierService] [Männliflue] master node changed {previous [{Mont Tondu}{rjcyAsKBQXOJo8JkfVygeg}{J-HNIfwrTlule6sbilMK6Q}{192.168.2.12}{192.168.2.12:4300}{dm}{http_address=192.168.2.12:4200}], current []}, term: 21, version: 151579, reason: becoming candidate: onLeaderFailure
[2025-04-14T16:23:32,098][INFO ][o.e.c.s.ClusterApplierService] [Männliflue] master node changed {previous [], current [{Mont Tondu}{rjcyAsKBQXOJo8JkfVygeg}{J-HNIfwrTlule6sbilMK6Q}{192.168.2.12}{192.168.2.12:4300}{dm}{http_address=192.168.2.12:4200}]}, term: 21, version: 151580, reason: ApplyCommitRequest{term=21, version=151580, sourceNode={Mont Tondu}{rjcyAsKBQXOJo8JkfVygeg}{J-HNIfwrTlule6sbilMK6Q}{192.168.2.12}{192.168.2.12:4300}{dm}{http_address=192.168.2.12:4200}}
[2025-04-14T16:23:51,695][WARN ][o.e.t.TransportService   ] [Männliflue] Received response for a request that has timed out, sent [52835ms] ago, timed out [42828ms] ago, action [internal:coordination/fault_detection/leader_check], node [{Mont Tondu}{rjcyAsKBQXOJo8JkfVygeg}{J-HNIfwrTlule6sbilMK6Q}{192.168.2.12}{192.168.2.12:4300}{dm}{http_address=192.168.2.12:4200}], id [3508073]
[2025-04-14T16:23:51,695][WARN ][o.e.t.TransportService   ] [Männliflue] Received response for a request that has timed out, sent [41828ms] ago, timed out [31822ms] ago, action [internal:coordination/fault_detection/leader_check], node [{Mont Tondu}{rjcyAsKBQXOJo8JkfVygeg}{J-HNIfwrTlule6sbilMK6Q}{192.168.2.12}{192.168.2.12:4300}{dm}{http_address=192.168.2.12:4200}], id [3508074]
[2025-04-14T16:23:51,695][WARN ][o.e.t.TransportService   ] [Männliflue] Received response for a request that has timed out, sent [30821ms] ago, timed out [20814ms] ago, action [internal:coordination/fault_detection/leader_check], node [{Mont Tondu}{rjcyAsKBQXOJo8JkfVygeg}{J-HNIfwrTlule6sbilMK6Q}{192.168.2.12}{192.168.2.12:4300}{dm}{http_address=192.168.2.12:4200}], id [3508075]
[2025-04-14T16:23:51,695][WARN ][o.e.t.TransportService   ] [Männliflue] Received response for a request that has timed out, sent [20614ms] ago, timed out [10607ms] ago, action [internal:coordination/fault_detection/leader_check], node [{Mont Tondu}{rjcyAsKBQXOJo8JkfVygeg}{J-HNIfwrTlule6sbilMK6Q}{192.168.2.12}{192.168.2.12:4300}{dm}{http_address=192.168.2.12:4200}], id [3508082]
[2025-04-14T16:24:22,201][INFO ][o.e.c.c.Coordinator      ] [Männliflue] master node [{Mont Tondu}{rjcyAsKBQXOJo8JkfVygeg}{J-HNIfwrTlule6sbilMK6Q}{192.168.2.12}{192.168.2.12:4300}{dm}{http_address=192.168.2.12:4200}] failed, restarting discovery
org.elasticsearch.ElasticsearchException: node [{Mont Tondu}{rjcyAsKBQXOJo8JkfVygeg}{J-HNIfwrTlule6sbilMK6Q}{192.168.2.12}{192.168.2.12:4300}{dm}{http_address=192.168.2.12:4200}] failed [3] consecutive checks
        at org.elasticsearch.cluster.coordination.LeaderChecker$CheckScheduler$1.handleException(LeaderChecker.java:267) ~[crate-server-5.6.5.jar:?]
        at org.elasticsearch.transport.TransportService$5.handleException(TransportService.java:520) ~[crate-server-5.6.5.jar:?]
        at org.elasticsearch.transport.TransportService$TimeoutResponseHandler.handleException(TransportService.java:1020) ~[crate-server-5.6.5.jar:?]
        at org.elasticsearch.transport.InboundHandler$1.doRun(InboundHandler.java:287) ~[crate-server-5.6.5.jar:?]
        at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37) ~[crate-server-5.6.5.jar:?]
        at org.elasticsearch.common.util.concurrent.EsExecutors$DirectExecutor.execute(EsExecutors.java:160) ~[crate-server-5.6.5.jar:?]
        at org.elasticsearch.transport.InboundHandler.handleException(InboundHandler.java:282) ~[crate-server-5.6.5.jar:?]
        at org.elasticsearch.transport.InboundHandler.handlerResponseError(InboundHandler.java:274) ~[crate-server-5.6.5.jar:?]
        at org.elasticsearch.transport.InboundHandler.messageReceived(InboundHandler.java:131) ~[crate-server-5.6.5.jar:?]
        at org.elasticsearch.transport.InboundHandler.inboundMessage(InboundHandler.java:96) ~[crate-server-5.6.5.jar:?]
        at org.elasticsearch.transport.TcpTransport.inboundMessage(TcpTransport.java:675) ~[crate-server-5.6.5.jar:?]
        at org.elasticsearch.transport.InboundPipeline.forwardFragments(InboundPipeline.java:143) ~[crate-server-5.6.5.jar:?] 
        at org.elasticsearch.transport.InboundPipeline.doHandleBytes(InboundPipeline.java:118) ~[crate-server-5.6.5.jar:?]
        at org.elasticsearch.transport.InboundPipeline.handleBytes(InboundPipeline.java:83) ~[crate-server-5.6.5.jar:?]
        at org.elasticsearch.transport.netty4.Netty4MessageChannelHandler.channelRead(Netty4MessageChannelHandler.java:71) ~[crate-server-5.6.5.jar:?]
        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:442) ~[netty-transport-4.1.104.Final.jar:4.1.104.Final]
        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:420) ~[netty-transport-4.1.104.Final.jar:4.1.104.Final]
        at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:412) ~[netty-transport-4.1.104.Final.jar:4.1.104.Final]
        at io.netty.handler.logging.LoggingHandler.channelRead(LoggingHandler.java:280) ~[netty-handler-4.1.104.Final.jar:4.1.104.Final]
        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:442) ~[netty-transport-4.1.104.Final.jar:4.1.104.Final]
        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:420) ~[netty-transport-4.1.104.Final.jar:4.1.104.Final]
        at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:412) ~[netty-transport-4.1.104.Final.jar:4.1.104.Final]
        at io.netty.channel.DefaultChannelPipeline$HeadContext.channelRead(DefaultChannelPipeline.java:1410) ~[netty-transport-4.1.104.Final.jar:4.1.104.Final]
        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:440) ~[netty-transport-4.1.104.Final.jar:4.1.104.Final]
        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:420) ~[netty-transport-4.1.104.Final.jar:4.1.104.Final]
        at io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:919) ~[netty-transport-4.1.104.Final.jar:4.1.104.Final]
        at io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:166) ~[netty-transport-4.1.104.Final.jar:4.1.104.Final]
        at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:788) ~[netty-transport-4.1.104.Final.jar:4.1.104.Final]
        at io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:724) ~[netty-transport-4.1.104.Final.jar:4.1.104.Final]
        at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:650) ~[netty-transport-4.1.104.Final.jar:4.1.104.Final]
        at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:562) ~[netty-transport-4.1.104.Final.jar:4.1.104.Final]
        at io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:997) ~[netty-common-4.1.104.Final.jar:4.1.104.Final]
        at io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74) ~[netty-common-4.1.104.Final.jar:4.1.104.Final]
        at java.base/java.lang.Thread.run(Thread.java:1583) [?:?

As if the network was looking packets … but as I’ve stated the two machines are directly linked by an ethernet cable and on both hosts I do not see any warning/error in dmesg about networking (soft or hard) issues …

Also when I check on the web-interface everything appears to be ok (cluster is green) but I see that some of my services are stuck while inserting data … I need to restart one node for the cluster to back to normal until it fails again after a while.

Does anyone has a clue about what could be causing this ?
Thanks

hernanc · April 14, 2025, 2:56pm

Hi Charles, nothing in particular comes to mind, but as this is happening on your test environment and version 5.6.5 is a bit dated now, my suggestion would be to try if the issue persists if you upgrade to one of the latest versions, would that be possible for you?

echarlus · April 14, 2025, 3:16pm

Hi Hernan,
thanks for the reply. I had try to upgrade to 5.9.10 a few weeks ago but I’ve had suspicion that the new version was using more memory than 5.6.x so I had to revert back to 5.6.x because I did not have the time to perform precise testing and move this version also to production (I have to run the same version on both test & production environment)
Do you have information about a change in memory footprint between 5.6.x & 5.9.x ?
Thanks.

hernanc · April 14, 2025, 3:26pm

There was nothing increasing memory usage, quite the opposite, plenty of optimizations which should bring better performance, could you try 5.10.4 instead of 5.9.10 ?

echarlus · April 14, 2025, 3:35pm

I could try … I thought I had read something about JOIN performances optimization that could lead to performance decrease in tables created in earlier versions of crate and that would require table re-creation in order to work properly but It seems I cannot find it anymore. Do you have information about this ?

Topic		Replies	Views
Unstable cluster with 5.3+ version CrateDB	4	251	December 11, 2023
3 nodes cluster suddenly failing CrateDB	10	1112	July 22, 2021
3 Node Cluster Not Working CrateDB	2	1231	February 24, 2020
I'm trying to set up a 3-node CrateDB cluster and now I can't even connect with crash CrateDB	8	1535	October 4, 2021
Crate DB Clustering on EC2 CrateDB	13	2463	May 14, 2021

2 node cluster recurrent issue with request timing out

Related topics