Allocating shards fails when setting routing.allocation.require._name

While building a script to shrink the number of shards of all partitions of a table, on a multi-node cluster, when changing the allocation of shards to a specific node by running:

ALTER TABLE <table_name>
PARTITION (edate = <partition_to_change>)
SET ("routing.allocation.require._name" = <node_allocate_shards>, "blocks.write" = true);

I get the following explanation: cannot rebalance as no target node exists that can both allocate this shard and improve the cluster balance and the following decision: the shard cannot be allocated to the same node on which a copy of the shard already exists cratedb

The SHOW CREATE TABLE output is:

CREATE TABLE IF NOT EXISTS "my_schema"."table1" (
   "timestamp" TIMESTAMP WITH TIME ZONE,
   "edate" TIMESTAMP WITH TIME ZONE GENERATED ALWAYS AS date_trunc('week', "timestamp"),
)
CLUSTERED INTO 6 SHARDS
PARTITIONED BY ("edate")
WITH (
   "allocation.max_retries" = 5,
   "blocks.metadata" = false,
   "blocks.read" = false,
   "blocks.read_only" = false,
   "blocks.read_only_allow_delete" = false,
   "blocks.write" = false,
   codec = 'default',
   column_policy = 'dynamic',
   "mapping.total_fields.limit" = 1000,
   max_ngram_diff = 1,
   max_shingle_diff = 3,
   number_of_replicas = '0-1',
   "routing.allocation.enable" = 'all',
   "routing.allocation.total_shards_per_node" = -1,
   "store.type" = 'fs',
   "translog.durability" = 'REQUEST',
   "translog.flush_threshold_size" = 536870912,
   "translog.sync_interval" = 5000,
   "unassigned.node_left.delayed_timeout" = 60000,
   "write.wait_for_active_shards" = '1'
)

This happens with 4 or 2 data nodes in the cluster.
Couldn’t find a solution for this, can anyone point me in the right direction? I can provide more info if needed.

number_of_replicas is set to '0-1' which implies that there are 2 copies of each shard on the data holding nodes (primary + 1 replica) which always have to be on two separate nodes. If you require only a single node to allocate all shards, this will fail and will lead to:

the shard cannot be allocated to the same node on which a copy of the shard already exists cratedb

So you either can temporary disable replicas or allow at least 2 nodes in the allocation

ALTER TABLE <table_name>
PARTITION (edate = <partition_to_change>)
SET ("number_of_replicas" = 0);
1 Like

I found the problem’s root cause. It was not related to the number of replicas. As I mentioned, the script was running a query, in order to allocate shards to a particular node for a particular partition. Following that, it executed the following query to check if all the shards were allocated to the same node:

SELECT distinct node_id 
FROM sys.allocations 
WHERE partition_ident = '<partition_to_change>'

If more than one node was returned, it would wait and check again.

Turns out, if you have multiple tables with multiple nodes in the same cluster, some partitions can have the same partition_ident. Therefore, the solution was to add the line AND table_name = '<table_name>'
With this conclusion I have 2 questions:

  1. Is there a better way to check for pending tasks, such as node allocation, like in elastic search?

  2. Wouldn’t it make more sense if partition_ident was unique on a cluster level? Not only table level?

Nevertheless, thank you for the help @proddata !

1 Like
  1. Is there a better way to check for pending tasks, such as node allocation, like in elastic search?

Maybe you could check state and routing_state in sys.shards?

1 Like