Hello,
I am running a Crate 5.2.10 3 node cluster and want to update to latest. The procedure I have followed is:
1.- Disable periodic data processing job to reduce system load to minimum. Still some background data insertion load exists (2 o 3 queries per second).
2.- Update node by node to 5.3.7 and restart after the update to have a 5.3.7 cluster.
3.- Update to 5.4.6 node by node (same as step 2)
4.- Update to 5.5.1 node by node (same as step 2)
After this the result is a 5.5.1 cluster and no insertion data has been lost during the process (remember, 2-3 insertions per second).
The last step is to enable again the processing job (it has a long pending queue) and it starts making queries (inserts and selects). The number of queries is about 10-15 per second.
The problem starts about 20 seconds after enabling this processing job, a random node gets disconnected for a short period of time and the cluster starts re-synchronizing. The cluster status becomes yellow and green again after 20-30 seconds. Then again after around 20 seconds the process starts again, a node gets disconnected, re-synchronizing process and again green health status.
If I stop the processing job the cluster becomes stable again and the background insertions are done correctly. It looks like under some load the cluster becomes unstable.
The fact is that with version 5.2.10 everything works as expected, I don’t see this behavior.
I have done some testing and it happens with 5.3, 5.4 and 5.5 series. With 5.2 works correctly.
This is the log entries I can see in the disconnected node (5.5.1):
[2023-12-04T17:06:27,567][WARN ][i.c.e.d.u.TransportShardUpsertAction] [crate-bemp-2] [[bemp_basa.ts_real][1]] failed to perform internal:crate:sql/data/write on replica [bemp_basa.ts_real][1], node[qul8gyHXSk--8eq3WNc6pQ], [R], s[STARTED], a[id=28qfsK2kTymJ8MUjo2_yuQ]
org.elasticsearch.transport.SendRequestTransportException: [crate-bemp-1][10.0.103.231:4300][internal:crate:sql/data/write[r]]
at org.elasticsearch.transport.TransportService.sendRequestInternal(TransportService.java:633) ~[crate-server-5.5.1.jar:?]
at org.elasticsearch.transport.TransportService.sendRequest(TransportService.java:534) ~[crate-server-5.5.1.jar:?]
at org.elasticsearch.transport.TransportService.sendRequest(TransportService.java:486) ~[crate-server-5.5.1.jar:?]
at org.elasticsearch.action.support.replication.TransportReplicationAction$ReplicasProxy.performOn(TransportReplicationAction.java:1035) ~[crate-server-5.5.1.jar:?]
at org.elasticsearch.action.support.replication.ReplicationOperation$3.tryAction(ReplicationOperation.java:248) ~[crate-server-5.5.1.jar:?]
at org.elasticsearch.action.support.RetryableAction$1.doRun(RetryableAction.java:96) ~[crate-server-5.5.1.jar:?]
at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37) ~[crate-server-5.5.1.jar:?]
at org.elasticsearch.action.support.RetryableAction.run(RetryableAction.java:74) ~[crate-server-5.5.1.jar:?]
at org.elasticsearch.action.support.replication.ReplicationOperation.performOnReplica(ReplicationOperation.java:267) ~[crate-server-5.5.1.jar:?]
at org.elasticsearch.action.support.replication.ReplicationOperation.performOnReplicas(ReplicationOperation.java:196) ~[crate-server-5.5.1.jar:?]
at org.elasticsearch.action.support.replication.ReplicationOperation.handlePrimaryResult(ReplicationOperation.java:152) ~[crate-server-5.5.1.jar:?]
at org.elasticsearch.action.ActionListener$2.onResponse(ActionListener.java:99) ~[crate-server-5.5.1.jar:?]
at org.elasticsearch.action.ActionListener.completeWith(ActionListener.java:240) ~[crate-server-5.5.1.jar:?]
at io.crate.execution.dml.TransportShardAction.shardOperationOnPrimary(TransportShardAction.java:98) ~[crate-server-5.5.1.jar:?]
at io.crate.execution.dml.TransportShardAction.shardOperationOnPrimary(TransportShardAction.java:53) ~[crate-server-5.5.1.jar:?]
at org.elasticsearch.action.support.replication.TransportReplicationAction$PrimaryShardReference.perform(TransportReplicationAction.java:927) ~[crate-server-5.5.1.jar:?]
at org.elasticsearch.action.support.replication.ReplicationOperation.execute(ReplicationOperation.java:126) ~[crate-server-5.5.1.jar:?]
at org.elasticsearch.action.support.replication.TransportReplicationAction$AsyncPrimaryAction.runWithPrimaryShardReference(TransportReplicationAction.java:428) ~[crate-server-5.5.1.jar:?]
at org.elasticsearch.action.support.replication.TransportReplicationAction$AsyncPrimaryAction.lambda$doRun$0(TransportReplicationAction.java:346) ~[crate-server-5.5.1.jar:?]
at org.elasticsearch.action.ActionListener$2.onResponse(ActionListener.java:99) ~[crate-server-5.5.1.jar:?]
at org.elasticsearch.index.shard.IndexShard.lambda$wrapPrimaryOperationPermitListener$21(IndexShard.java:2721) ~[crate-server-5.5.1.jar:?]
at org.elasticsearch.action.ActionListener$3.onResponse(ActionListener.java:127) ~[crate-server-5.5.1.jar:?]
at org.elasticsearch.index.shard.IndexShardOperationPermits.acquire(IndexShardOperationPermits.java:292) ~[crate-server-5.5.1.jar:?]
at org.elasticsearch.index.shard.IndexShardOperationPermits.acquire(IndexShardOperationPermits.java:242) ~[crate-server-5.5.1.jar:?]
at org.elasticsearch.index.shard.IndexShard.acquirePrimaryOperationPermit(IndexShard.java:2695) ~[crate-server-5.5.1.jar:?]
at org.elasticsearch.action.support.replication.TransportReplicationAction.acquirePrimaryOperationPermit(TransportReplicationAction.java:868) ~[crate-server-5.5.1.jar:?]
at org.elasticsearch.action.support.replication.TransportReplicationAction$AsyncPrimaryAction.doRun(TransportReplicationAction.java:342) ~[crate-server-5.5.1.jar:?]
at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37) ~[crate-server-5.5.1.jar:?]
at org.elasticsearch.action.support.replication.TransportReplicationAction.handlePrimaryRequest(TransportReplicationAction.java:308) ~[crate-server-5.5.1.jar:?]
at org.elasticsearch.transport.RequestHandlerRegistry.processMessageReceived(RequestHandlerRegistry.java:59) ~[crate-server-5.5.1.jar:?]
at org.elasticsearch.transport.TransportService$7.doRun(TransportService.java:682) ~[crate-server-5.5.1.jar:?]
at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37) ~[crate-server-5.5.1.jar:?]
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1144) ~[?:?]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:642) ~[?:?]
at java.lang.Thread.run(Thread.java:1583) [?:?]
Caused by: java.lang.ClassCastException: class java.lang.Long cannot be cast to class java.lang.String (java.lang.Long and java.lang.String are in module java.base of loader 'bootstrap')
at io.crate.types.StringType.writeValueTo(StringType.java:66) ~[crate-server-5.5.1.jar:?]
at io.crate.execution.dml.upsert.ShardUpsertRequest$Item.writeTo(ShardUpsertRequest.java:480) ~[crate-server-5.5.1.jar:?]
at io.crate.execution.dml.upsert.ShardUpsertRequest.writeTo(ShardUpsertRequest.java:182) ~[crate-server-5.5.1.jar:?]
at org.elasticsearch.action.support.replication.TransportReplicationAction$ConcreteShardRequest.writeTo(TransportReplicationAction.java:1107) ~[crate-server-5.5.1.jar:?]
at org.elasticsearch.action.support.replication.TransportReplicationAction$ConcreteReplicaRequest.writeTo(TransportReplicationAction.java:1148) ~[crate-server-5.5.1.jar:?]
at org.elasticsearch.transport.OutboundMessage.writeMessage(OutboundMessage.java:90) ~[crate-server-5.5.1.jar:?]
at org.elasticsearch.transport.OutboundMessage.serialize(OutboundMessage.java:67) ~[crate-server-5.5.1.jar:?]
at org.elasticsearch.transport.OutboundHandler.sendMessage(OutboundHandler.java:159) ~[crate-server-5.5.1.jar:?]
at org.elasticsearch.transport.OutboundHandler.sendRequest(OutboundHandler.java:104) ~[crate-server-5.5.1.jar:?]
at org.elasticsearch.transport.TcpTransport$NodeChannels.sendRequest(TcpTransport.java:249) ~[crate-server-5.5.1.jar:?]
at org.elasticsearch.transport.TransportService.sendRequestInternal(TransportService.java:616) ~[crate-server-5.5.1.jar:?]
... 34 more
[2023-12-04T17:06:27,567][WARN ][i.c.e.d.u.TransportShardUpsertAction] [crate-bemp-2] [[bemp_basa.ts_real][2]] failed to perform internal:crate:sql/data/write on replica [bemp_basa.ts_real][2], node[nXTsFovbSVycjvBlUj47rg], [R], s[STARTED], a[id=tvevWkQCRSqm2C7SR44tVA]
org.elasticsearch.transport.SendRequestTransportException: [crate-bemp-0][10.0.103.47:4300][internal:crate:sql/data/write[r]]
at org.elasticsearch.transport.TransportService.sendRequestInternal(TransportService.java:633) ~[crate-server-5.5.1.jar:?]
at org.elasticsearch.transport.TransportService.sendRequest(TransportService.java:534) ~[crate-server-5.5.1.jar:?]
at org.elasticsearch.transport.TransportService.sendRequest(TransportService.java:486) ~[crate-server-5.5.1.jar:?]
at org.elasticsearch.action.support.replication.TransportReplicationAction$ReplicasProxy.performOn(TransportReplicationAction.java:1035) ~[crate-server-5.5.1.jar:?]
at org.elasticsearch.action.support.replication.ReplicationOperation$3.tryAction(ReplicationOperation.java:248) ~[crate-server-5.5.1.jar:?]
at org.elasticsearch.action.support.RetryableAction$1.doRun(RetryableAction.java:96) ~[crate-server-5.5.1.jar:?]
at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37) ~[crate-server-5.5.1.jar:?]
at org.elasticsearch.action.support.RetryableAction.run(RetryableAction.java:74) ~[crate-server-5.5.1.jar:?]
at org.elasticsearch.action.support.replication.ReplicationOperation.performOnReplica(ReplicationOperation.java:267) ~[crate-server-5.5.1.jar:?]
at org.elasticsearch.action.support.replication.ReplicationOperation.performOnReplicas(ReplicationOperation.java:196) ~[crate-server-5.5.1.jar:?]
at org.elasticsearch.action.support.replication.ReplicationOperation.handlePrimaryResult(ReplicationOperation.java:152) ~[crate-server-5.5.1.jar:?]
at org.elasticsearch.action.ActionListener$2.onResponse(ActionListener.java:99) ~[crate-server-5.5.1.jar:?]
at org.elasticsearch.action.ActionListener.completeWith(ActionListener.java:240) ~[crate-server-5.5.1.jar:?]
at io.crate.execution.dml.TransportShardAction.shardOperationOnPrimary(TransportShardAction.java:98) ~[crate-server-5.5.1.jar:?]
at io.crate.execution.dml.TransportShardAction.shardOperationOnPrimary(TransportShardAction.java:53) ~[crate-server-5.5.1.jar:?]
at org.elasticsearch.action.support.replication.TransportReplicationAction$PrimaryShardReference.perform(TransportReplicationAction.java:927) ~[crate-server-5.5.1.jar:?]
at org.elasticsearch.action.support.replication.ReplicationOperation.execute(ReplicationOperation.java:126) ~[crate-server-5.5.1.jar:?]
at org.elasticsearch.action.support.replication.TransportReplicationAction$AsyncPrimaryAction.runWithPrimaryShardReference(TransportReplicationAction.java:428) ~[crate-server-5.5.1.jar:?]
at org.elasticsearch.action.support.replication.TransportReplicationAction$AsyncPrimaryAction.lambda$doRun$0(TransportReplicationAction.java:346) ~[crate-server-5.5.1.jar:?]
at org.elasticsearch.action.ActionListener$2.onResponse(ActionListener.java:99) ~[crate-server-5.5.1.jar:?]
at org.elasticsearch.index.shard.IndexShard.lambda$wrapPrimaryOperationPermitListener$21(IndexShard.java:2721) ~[crate-server-5.5.1.jar:?]
at org.elasticsearch.action.ActionListener$3.onResponse(ActionListener.java:127) ~[crate-server-5.5.1.jar:?]
at org.elasticsearch.index.shard.IndexShardOperationPermits.acquire(IndexShardOperationPermits.java:292) ~[crate-server-5.5.1.jar:?]
at org.elasticsearch.index.shard.IndexShardOperationPermits.acquire(IndexShardOperationPermits.java:242) ~[crate-server-5.5.1.jar:?]
at org.elasticsearch.index.shard.IndexShard.acquirePrimaryOperationPermit(IndexShard.java:2695) ~[crate-server-5.5.1.jar:?]
at org.elasticsearch.action.support.replication.TransportReplicationAction.acquirePrimaryOperationPermit(TransportReplicationAction.java:868) ~[crate-server-5.5.1.jar:?]
at org.elasticsearch.action.support.replication.TransportReplicationAction$AsyncPrimaryAction.doRun(TransportReplicationAction.java:342) ~[crate-server-5.5.1.jar:?]
at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37) ~[crate-server-5.5.1.jar:?]
at org.elasticsearch.action.support.replication.TransportReplicationAction.handlePrimaryRequest(TransportReplicationAction.java:308) ~[crate-server-5.5.1.jar:?]
at org.elasticsearch.transport.RequestHandlerRegistry.processMessageReceived(RequestHandlerRegistry.java:59) ~[crate-server-5.5.1.jar:?]
at org.elasticsearch.transport.TransportService$7.doRun(TransportService.java:682) ~[crate-server-5.5.1.jar:?]
at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37) ~[crate-server-5.5.1.jar:?]
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1144) ~[?:?]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:642) ~[?:?]
at java.lang.Thread.run(Thread.java:1583) [?:?]
Caused by: java.lang.ClassCastException: class java.lang.Long cannot be cast to class java.lang.String (java.lang.Long and java.lang.String are in module java.base of loader 'bootstrap')
at io.crate.types.StringType.writeValueTo(StringType.java:66) ~[crate-server-5.5.1.jar:?]
at io.crate.execution.dml.upsert.ShardUpsertRequest$Item.writeTo(ShardUpsertRequest.java:480) ~[crate-server-5.5.1.jar:?]
at io.crate.execution.dml.upsert.ShardUpsertRequest.writeTo(ShardUpsertRequest.java:182) ~[crate-server-5.5.1.jar:?]
at org.elasticsearch.action.support.replication.TransportReplicationAction$ConcreteShardRequest.writeTo(TransportReplicationAction.java:1107) ~[crate-server-5.5.1.jar:?]
at org.elasticsearch.action.support.replication.TransportReplicationAction$ConcreteReplicaRequest.writeTo(TransportReplicationAction.java:1148) ~[crate-server-5.5.1.jar:?]
at org.elasticsearch.transport.OutboundMessage.writeMessage(OutboundMessage.java:90) ~[crate-server-5.5.1.jar:?]
at org.elasticsearch.transport.OutboundMessage.serialize(OutboundMessage.java:67) ~[crate-server-5.5.1.jar:?]
at org.elasticsearch.transport.OutboundHandler.sendMessage(OutboundHandler.java:159) ~[crate-server-5.5.1.jar:?]
at org.elasticsearch.transport.OutboundHandler.sendRequest(OutboundHandler.java:104) ~[crate-server-5.5.1.jar:?]
at org.elasticsearch.transport.TcpTransport$NodeChannels.sendRequest(TcpTransport.java:249) ~[crate-server-5.5.1.jar:?]
at org.elasticsearch.transport.TransportService.sendRequestInternal(TransportService.java:616) ~[crate-server-5.5.1.jar:?]
... 34 more
After cluster update the only solution I have found is to downgrade again to 5.2.10 and restore data backup.
Each node has 12gb of heap memory configured and 8 cpus. The cluster as I said is under light load.
Any idea of what is happening?
Best regards,