Unstable cluster with 5.3+ version

Hello,

I am running a Crate 5.2.10 3 node cluster and want to update to latest. The procedure I have followed is:
1.- Disable periodic data processing job to reduce system load to minimum. Still some background data insertion load exists (2 o 3 queries per second).
2.- Update node by node to 5.3.7 and restart after the update to have a 5.3.7 cluster.
3.- Update to 5.4.6 node by node (same as step 2)
4.- Update to 5.5.1 node by node (same as step 2)

After this the result is a 5.5.1 cluster and no insertion data has been lost during the process (remember, 2-3 insertions per second).

The last step is to enable again the processing job (it has a long pending queue) and it starts making queries (inserts and selects). The number of queries is about 10-15 per second.

The problem starts about 20 seconds after enabling this processing job, a random node gets disconnected for a short period of time and the cluster starts re-synchronizing. The cluster status becomes yellow and green again after 20-30 seconds. Then again after around 20 seconds the process starts again, a node gets disconnected, re-synchronizing process and again green health status.

If I stop the processing job the cluster becomes stable again and the background insertions are done correctly. It looks like under some load the cluster becomes unstable.

The fact is that with version 5.2.10 everything works as expected, I don’t see this behavior.

I have done some testing and it happens with 5.3, 5.4 and 5.5 series. With 5.2 works correctly.

This is the log entries I can see in the disconnected node (5.5.1):

[2023-12-04T17:06:27,567][WARN ][i.c.e.d.u.TransportShardUpsertAction] [crate-bemp-2] [[bemp_basa.ts_real][1]] failed to perform internal:crate:sql/data/write on replica [bemp_basa.ts_real][1], node[qul8gyHXSk--8eq3WNc6pQ], [R], s[STARTED], a[id=28qfsK2kTymJ8MUjo2_yuQ]
org.elasticsearch.transport.SendRequestTransportException: [crate-bemp-1][10.0.103.231:4300][internal:crate:sql/data/write[r]]
	at org.elasticsearch.transport.TransportService.sendRequestInternal(TransportService.java:633) ~[crate-server-5.5.1.jar:?]
	at org.elasticsearch.transport.TransportService.sendRequest(TransportService.java:534) ~[crate-server-5.5.1.jar:?]
	at org.elasticsearch.transport.TransportService.sendRequest(TransportService.java:486) ~[crate-server-5.5.1.jar:?]
	at org.elasticsearch.action.support.replication.TransportReplicationAction$ReplicasProxy.performOn(TransportReplicationAction.java:1035) ~[crate-server-5.5.1.jar:?]
	at org.elasticsearch.action.support.replication.ReplicationOperation$3.tryAction(ReplicationOperation.java:248) ~[crate-server-5.5.1.jar:?]
	at org.elasticsearch.action.support.RetryableAction$1.doRun(RetryableAction.java:96) ~[crate-server-5.5.1.jar:?]
	at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37) ~[crate-server-5.5.1.jar:?]
	at org.elasticsearch.action.support.RetryableAction.run(RetryableAction.java:74) ~[crate-server-5.5.1.jar:?]
	at org.elasticsearch.action.support.replication.ReplicationOperation.performOnReplica(ReplicationOperation.java:267) ~[crate-server-5.5.1.jar:?]
	at org.elasticsearch.action.support.replication.ReplicationOperation.performOnReplicas(ReplicationOperation.java:196) ~[crate-server-5.5.1.jar:?]
	at org.elasticsearch.action.support.replication.ReplicationOperation.handlePrimaryResult(ReplicationOperation.java:152) ~[crate-server-5.5.1.jar:?]
	at org.elasticsearch.action.ActionListener$2.onResponse(ActionListener.java:99) ~[crate-server-5.5.1.jar:?]
	at org.elasticsearch.action.ActionListener.completeWith(ActionListener.java:240) ~[crate-server-5.5.1.jar:?]
	at io.crate.execution.dml.TransportShardAction.shardOperationOnPrimary(TransportShardAction.java:98) ~[crate-server-5.5.1.jar:?]
	at io.crate.execution.dml.TransportShardAction.shardOperationOnPrimary(TransportShardAction.java:53) ~[crate-server-5.5.1.jar:?]
	at org.elasticsearch.action.support.replication.TransportReplicationAction$PrimaryShardReference.perform(TransportReplicationAction.java:927) ~[crate-server-5.5.1.jar:?]
	at org.elasticsearch.action.support.replication.ReplicationOperation.execute(ReplicationOperation.java:126) ~[crate-server-5.5.1.jar:?]
	at org.elasticsearch.action.support.replication.TransportReplicationAction$AsyncPrimaryAction.runWithPrimaryShardReference(TransportReplicationAction.java:428) ~[crate-server-5.5.1.jar:?]
	at org.elasticsearch.action.support.replication.TransportReplicationAction$AsyncPrimaryAction.lambda$doRun$0(TransportReplicationAction.java:346) ~[crate-server-5.5.1.jar:?]
	at org.elasticsearch.action.ActionListener$2.onResponse(ActionListener.java:99) ~[crate-server-5.5.1.jar:?]
	at org.elasticsearch.index.shard.IndexShard.lambda$wrapPrimaryOperationPermitListener$21(IndexShard.java:2721) ~[crate-server-5.5.1.jar:?]
	at org.elasticsearch.action.ActionListener$3.onResponse(ActionListener.java:127) ~[crate-server-5.5.1.jar:?]
	at org.elasticsearch.index.shard.IndexShardOperationPermits.acquire(IndexShardOperationPermits.java:292) ~[crate-server-5.5.1.jar:?]
	at org.elasticsearch.index.shard.IndexShardOperationPermits.acquire(IndexShardOperationPermits.java:242) ~[crate-server-5.5.1.jar:?]
	at org.elasticsearch.index.shard.IndexShard.acquirePrimaryOperationPermit(IndexShard.java:2695) ~[crate-server-5.5.1.jar:?]
	at org.elasticsearch.action.support.replication.TransportReplicationAction.acquirePrimaryOperationPermit(TransportReplicationAction.java:868) ~[crate-server-5.5.1.jar:?]
	at org.elasticsearch.action.support.replication.TransportReplicationAction$AsyncPrimaryAction.doRun(TransportReplicationAction.java:342) ~[crate-server-5.5.1.jar:?]
	at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37) ~[crate-server-5.5.1.jar:?]
	at org.elasticsearch.action.support.replication.TransportReplicationAction.handlePrimaryRequest(TransportReplicationAction.java:308) ~[crate-server-5.5.1.jar:?]
	at org.elasticsearch.transport.RequestHandlerRegistry.processMessageReceived(RequestHandlerRegistry.java:59) ~[crate-server-5.5.1.jar:?]
	at org.elasticsearch.transport.TransportService$7.doRun(TransportService.java:682) ~[crate-server-5.5.1.jar:?]
	at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37) ~[crate-server-5.5.1.jar:?]
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1144) ~[?:?]
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:642) ~[?:?]
	at java.lang.Thread.run(Thread.java:1583) [?:?]
Caused by: java.lang.ClassCastException: class java.lang.Long cannot be cast to class java.lang.String (java.lang.Long and java.lang.String are in module java.base of loader 'bootstrap')
	at io.crate.types.StringType.writeValueTo(StringType.java:66) ~[crate-server-5.5.1.jar:?]
	at io.crate.execution.dml.upsert.ShardUpsertRequest$Item.writeTo(ShardUpsertRequest.java:480) ~[crate-server-5.5.1.jar:?]
	at io.crate.execution.dml.upsert.ShardUpsertRequest.writeTo(ShardUpsertRequest.java:182) ~[crate-server-5.5.1.jar:?]
	at org.elasticsearch.action.support.replication.TransportReplicationAction$ConcreteShardRequest.writeTo(TransportReplicationAction.java:1107) ~[crate-server-5.5.1.jar:?]
	at org.elasticsearch.action.support.replication.TransportReplicationAction$ConcreteReplicaRequest.writeTo(TransportReplicationAction.java:1148) ~[crate-server-5.5.1.jar:?]
	at org.elasticsearch.transport.OutboundMessage.writeMessage(OutboundMessage.java:90) ~[crate-server-5.5.1.jar:?]
	at org.elasticsearch.transport.OutboundMessage.serialize(OutboundMessage.java:67) ~[crate-server-5.5.1.jar:?]
	at org.elasticsearch.transport.OutboundHandler.sendMessage(OutboundHandler.java:159) ~[crate-server-5.5.1.jar:?]
	at org.elasticsearch.transport.OutboundHandler.sendRequest(OutboundHandler.java:104) ~[crate-server-5.5.1.jar:?]
	at org.elasticsearch.transport.TcpTransport$NodeChannels.sendRequest(TcpTransport.java:249) ~[crate-server-5.5.1.jar:?]
	at org.elasticsearch.transport.TransportService.sendRequestInternal(TransportService.java:616) ~[crate-server-5.5.1.jar:?]
	... 34 more
[2023-12-04T17:06:27,567][WARN ][i.c.e.d.u.TransportShardUpsertAction] [crate-bemp-2] [[bemp_basa.ts_real][2]] failed to perform internal:crate:sql/data/write on replica [bemp_basa.ts_real][2], node[nXTsFovbSVycjvBlUj47rg], [R], s[STARTED], a[id=tvevWkQCRSqm2C7SR44tVA]
org.elasticsearch.transport.SendRequestTransportException: [crate-bemp-0][10.0.103.47:4300][internal:crate:sql/data/write[r]]
	at org.elasticsearch.transport.TransportService.sendRequestInternal(TransportService.java:633) ~[crate-server-5.5.1.jar:?]
	at org.elasticsearch.transport.TransportService.sendRequest(TransportService.java:534) ~[crate-server-5.5.1.jar:?]
	at org.elasticsearch.transport.TransportService.sendRequest(TransportService.java:486) ~[crate-server-5.5.1.jar:?]
	at org.elasticsearch.action.support.replication.TransportReplicationAction$ReplicasProxy.performOn(TransportReplicationAction.java:1035) ~[crate-server-5.5.1.jar:?]
	at org.elasticsearch.action.support.replication.ReplicationOperation$3.tryAction(ReplicationOperation.java:248) ~[crate-server-5.5.1.jar:?]
	at org.elasticsearch.action.support.RetryableAction$1.doRun(RetryableAction.java:96) ~[crate-server-5.5.1.jar:?]
	at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37) ~[crate-server-5.5.1.jar:?]
	at org.elasticsearch.action.support.RetryableAction.run(RetryableAction.java:74) ~[crate-server-5.5.1.jar:?]
	at org.elasticsearch.action.support.replication.ReplicationOperation.performOnReplica(ReplicationOperation.java:267) ~[crate-server-5.5.1.jar:?]
	at org.elasticsearch.action.support.replication.ReplicationOperation.performOnReplicas(ReplicationOperation.java:196) ~[crate-server-5.5.1.jar:?]
	at org.elasticsearch.action.support.replication.ReplicationOperation.handlePrimaryResult(ReplicationOperation.java:152) ~[crate-server-5.5.1.jar:?]
	at org.elasticsearch.action.ActionListener$2.onResponse(ActionListener.java:99) ~[crate-server-5.5.1.jar:?]
	at org.elasticsearch.action.ActionListener.completeWith(ActionListener.java:240) ~[crate-server-5.5.1.jar:?]
	at io.crate.execution.dml.TransportShardAction.shardOperationOnPrimary(TransportShardAction.java:98) ~[crate-server-5.5.1.jar:?]
	at io.crate.execution.dml.TransportShardAction.shardOperationOnPrimary(TransportShardAction.java:53) ~[crate-server-5.5.1.jar:?]
	at org.elasticsearch.action.support.replication.TransportReplicationAction$PrimaryShardReference.perform(TransportReplicationAction.java:927) ~[crate-server-5.5.1.jar:?]
	at org.elasticsearch.action.support.replication.ReplicationOperation.execute(ReplicationOperation.java:126) ~[crate-server-5.5.1.jar:?]
	at org.elasticsearch.action.support.replication.TransportReplicationAction$AsyncPrimaryAction.runWithPrimaryShardReference(TransportReplicationAction.java:428) ~[crate-server-5.5.1.jar:?]
	at org.elasticsearch.action.support.replication.TransportReplicationAction$AsyncPrimaryAction.lambda$doRun$0(TransportReplicationAction.java:346) ~[crate-server-5.5.1.jar:?]
	at org.elasticsearch.action.ActionListener$2.onResponse(ActionListener.java:99) ~[crate-server-5.5.1.jar:?]
	at org.elasticsearch.index.shard.IndexShard.lambda$wrapPrimaryOperationPermitListener$21(IndexShard.java:2721) ~[crate-server-5.5.1.jar:?]
	at org.elasticsearch.action.ActionListener$3.onResponse(ActionListener.java:127) ~[crate-server-5.5.1.jar:?]
	at org.elasticsearch.index.shard.IndexShardOperationPermits.acquire(IndexShardOperationPermits.java:292) ~[crate-server-5.5.1.jar:?]
	at org.elasticsearch.index.shard.IndexShardOperationPermits.acquire(IndexShardOperationPermits.java:242) ~[crate-server-5.5.1.jar:?]
	at org.elasticsearch.index.shard.IndexShard.acquirePrimaryOperationPermit(IndexShard.java:2695) ~[crate-server-5.5.1.jar:?]
	at org.elasticsearch.action.support.replication.TransportReplicationAction.acquirePrimaryOperationPermit(TransportReplicationAction.java:868) ~[crate-server-5.5.1.jar:?]
	at org.elasticsearch.action.support.replication.TransportReplicationAction$AsyncPrimaryAction.doRun(TransportReplicationAction.java:342) ~[crate-server-5.5.1.jar:?]
	at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37) ~[crate-server-5.5.1.jar:?]
	at org.elasticsearch.action.support.replication.TransportReplicationAction.handlePrimaryRequest(TransportReplicationAction.java:308) ~[crate-server-5.5.1.jar:?]
	at org.elasticsearch.transport.RequestHandlerRegistry.processMessageReceived(RequestHandlerRegistry.java:59) ~[crate-server-5.5.1.jar:?]
	at org.elasticsearch.transport.TransportService$7.doRun(TransportService.java:682) ~[crate-server-5.5.1.jar:?]
	at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37) ~[crate-server-5.5.1.jar:?]
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1144) ~[?:?]
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:642) ~[?:?]
	at java.lang.Thread.run(Thread.java:1583) [?:?]
Caused by: java.lang.ClassCastException: class java.lang.Long cannot be cast to class java.lang.String (java.lang.Long and java.lang.String are in module java.base of loader 'bootstrap')
	at io.crate.types.StringType.writeValueTo(StringType.java:66) ~[crate-server-5.5.1.jar:?]
	at io.crate.execution.dml.upsert.ShardUpsertRequest$Item.writeTo(ShardUpsertRequest.java:480) ~[crate-server-5.5.1.jar:?]
	at io.crate.execution.dml.upsert.ShardUpsertRequest.writeTo(ShardUpsertRequest.java:182) ~[crate-server-5.5.1.jar:?]
	at org.elasticsearch.action.support.replication.TransportReplicationAction$ConcreteShardRequest.writeTo(TransportReplicationAction.java:1107) ~[crate-server-5.5.1.jar:?]
	at org.elasticsearch.action.support.replication.TransportReplicationAction$ConcreteReplicaRequest.writeTo(TransportReplicationAction.java:1148) ~[crate-server-5.5.1.jar:?]
	at org.elasticsearch.transport.OutboundMessage.writeMessage(OutboundMessage.java:90) ~[crate-server-5.5.1.jar:?]
	at org.elasticsearch.transport.OutboundMessage.serialize(OutboundMessage.java:67) ~[crate-server-5.5.1.jar:?]
	at org.elasticsearch.transport.OutboundHandler.sendMessage(OutboundHandler.java:159) ~[crate-server-5.5.1.jar:?]
	at org.elasticsearch.transport.OutboundHandler.sendRequest(OutboundHandler.java:104) ~[crate-server-5.5.1.jar:?]
	at org.elasticsearch.transport.TcpTransport$NodeChannels.sendRequest(TcpTransport.java:249) ~[crate-server-5.5.1.jar:?]
	at org.elasticsearch.transport.TransportService.sendRequestInternal(TransportService.java:616) ~[crate-server-5.5.1.jar:?]
	... 34 more

After cluster update the only solution I have found is to downgrade again to 5.2.10 and restore data backup.

Each node has 12gb of heap memory configured and 8 cpus. The cluster as I said is under light load.

Any idea of what is happening?

Best regards,

1 Like

Could you maybe provide a bit more details regarding what queries you are running?

1 Like

Data is inserted by sensors at random time, some every minute, other every 5 minutes. The load result is 2-3 insertion queries per second. The inserts are like

INSERT INTO ts_real_raw (real_id, date, value) VALUES (5, '2023-12-05 12:32:05', 5.56)

Data tables are like:

CREATE TABLE IF NOT EXISTS "bemp_basa"."ts_real_raw" (
   "real_id" BIGINT NOT NULL,
   "date" TIMESTAMP WITHOUT TIME ZONE NOT NULL,
   "value" DOUBLE PRECISION,
   "value_text" TEXT,
   "modification_date" TIMESTAMP WITHOUT TIME ZONE GENERATED ALWAYS AS _cast(_cast(current_timestamp, 'timestamp without time zone'), 'timestamp without time zone'),
   "year" INTEGER GENERATED ALWAYS AS EXTRACT(YEAR FROM "date") NOT NULL,
   PRIMARY KEY ("real_id", "date", "year")
)
CLUSTERED INTO 4 SHARDS

The column value_text for the moment is always null.

The processing job simply makes a query every time new data arrives like:

SELECT date, value FROM "bemp_basa"."ts_real_raw" WHERE real_id = 5 AND date < '2023-12-05 12:32:05' AND date >= ''2023-12-05 11:32:05',

Then it makes some calculations and saves the result in a table (for example “bemp_basa”.“ts_real”) defined exactly as ts_real_raw table using an INSERT statement.

This is the query load right now with the cluster running 5.2.10:

And this is yesterday’s load where you can see the queue processing spike after 20:00 more or less:

During my tests I emptied the destination table and could see some data being inserted before starting the re-synchronizing loop, so the cluster works for some seconds.

With 5.2.10 version everything works as expected. I also tried to recreate the destination tables (DROP and CREATE TABLE) to update version, but still the same error.

I hope this explanation is clear enough since it is quite strange and difficult to explain problem. I can make further clarifications if needed.

1 Like

This looks like a bug. Could you create an issue at github please?
Also could you verify the issue when using a fresh 5.5 cluster (not an upgraded one)?
Adding reproducible steps to the issue if possible would help us a lot.
Thanks!

1 Like

I have been able to reproduce the error on a clean 5.5.1 cluster locally using docker-compose. I have opened an issue in Github with the steps and code to reproduce the error.

1 Like