Cannot allocate because all found copies of the shard are either stale or corrupt

gruselglatz · July 12, 2022, 1:15pm

All of a sudden one primary cant be allocated anymore.

select * FROM sys.allocations WHERE partition_ident = '082j4c1i6813e' limit 100;

/usr/share/crate/lib# /usr/share/crate/jdk/bin/java -cp lucene-core*.jar -ea:org.apache.lucene... org.apache.lucene.index.CheckIndex /mnt/crate_data/nodes/0/indices/eMFzNmvSQ1y9WA3V7jAdrA/2/index/

Gives me :

No problems were detected with this index.

Took 1336.002 sec total.

ALTER CLUSTER REROUTE RETRY FAILED;

Doesnt work

ALTER TABLE "dadata" PARTITION (monthreceived = '7' , yearreceived = '2022') REROUTE PRomote replica SHARD 2 ON 'dacrate02' WITH (accept_data_loss = TRUE);

Doesnt work either.

IDK anymore further. How can I force allocate the shard like with elasticsearch? IDK why it lost this shard, no logging even on Debug is shown…

I hope someone can help me, thx.

jayeff · July 12, 2022, 2:08pm

Hi @gruselglatz,

what version of CrateDB are you running? Can you please share the CREATE TABLE statement for the affected table with us (SHOW CREATE TABLE)? Is the node in good health? Did you try to restart the node?

When you say ALTER CLUSTER/TABLE command do not work what do you mean? What is the output for these queries? Do you get an error?

gruselglatz · July 12, 2022, 2:12pm

Hi @jayeff !

I use 4.8.1
Yes i restarted everything multiple times. All nodes are in good health, no watermarks or disk problems, no ram problems, no network problems.

I shortend the output of SHOW CREATE TABLE because i dont think our fields are the problem:

CREATE TABLE IF NOT EXISTS "doc"."dadata" (
...
)
CLUSTERED INTO 3 SHARDS
PARTITIONED BY ("yearreceived", "monthreceived")
WITH (
   "allocation.max_retries" = 5,
   "blocks.metadata" = false,
   "blocks.read" = false,
   "blocks.read_only" = false,
   "blocks.read_only_allow_delete" = false,
   "blocks.write" = false,
   codec = 'best_compression',
   column_policy = 'strict',
   "mapping.total_fields.limit" = 1000,
   max_ngram_diff = 1,
   max_shingle_diff = 3,
   number_of_replicas = '0',
   refresh_interval = 1000,
   "routing.allocation.enable" = 'all',
   "routing.allocation.total_shards_per_node" = -1,
   "store.type" = 'fs',
   "translog.durability" = 'REQUEST',
   "translog.flush_threshold_size" = 536870912,
   "translog.sync_interval" = 5000,
   "unassigned.node_left.delayed_timeout" = 60000,
   "write.wait_for_active_shards" = '1'
)

The ALTER commands give

ALTER OK, 1 record affected (0.197 seconds)

Without any visible affect or running job.

jayeff · July 12, 2022, 3:05pm

Does number_of_replicas = '0' also hold for all partitions?

If yes, then this will be the reason why REROUTE PRomote replica will not work as there is no replica to promote. I assume REROUTE RETRY FAILED does not work due to the shard being corrupt.

We do recommend to have at least 1 replica configured to prevent potential data loss.

Do you have a snapshot/backup of your partition that you can restore?

gruselglatz · July 12, 2022, 3:10pm

How do i show if its set on a specific partition?

How can the shard be corrupt, when lucene check said it is not corrupt? and even cratedb dont log something that would indicate it is corrupt?

No i dont have a snapshot of this shard, the other 2 in this partion work flawlessly and it occured all of a sudden. It resides on a filesystem which is protected against corruption, so not a single disk or something.

jayeff · July 13, 2022, 9:29am

With the following query you can inspect all partitions of a table:

select table_name, partition_ident, number_of_replicas from information_schema.table_partitions where table_name = '<table_name>';

I can only speculate as I don’t have insight. Maybe some node failure caused the staleness/corruption.

I’m sorry that I can’t be of more help.

gruselglatz · July 13, 2022, 9:32am

Ok Thanks!

The number of replicas is different over the partitions. Maybe some change in the structure in the team caused this.

OK so there is no chance to bring the shard back to live, even when lucene check finds no error?
Can I maybe find more info in some deeper log? Or can i force allocate the shard somehow?

gruselglatz · July 13, 2022, 10:40am

@jayeff all of a sudden 3 more shards in the same index got corrupt. Without any error in the logs or any hardware/network failure. Same as above, lucene dont see any errors when I check them.

IDK what to do anymore. Simply holding replicas for the case that shards can go corrupt all the time is a little bit strange. We also have some bigger Elasticsearch/Opensearch clusters, on the same Hardwarebase and we never saw things like that.

jayeff · July 13, 2022, 11:51am

Number of replicas does not change by itself. Maybe the value was changed at some point with ALTER TABLE SET.

Unfortunately the affected partition seems to be one without a replica

I would recommend that you update existing partitions with current number of replicas set to zero to 1 replica.

I’m afraid that it will won’t be possible to bring this shard back Force allocation would be done by REROUTE RETRY FAILED or REROUTE PROMOTE REPLICA which did not work in your case.

Really sorry to hear. We never saw such a behaviour in the CrateDB clusters we host and I don’t have an explanation what is going on.

That said shards in CrateDB don’t simply randomly go corrupt so I believe something else most be afoot.

My guess would be something related to hardware or networking (but strange that other clusters on the same hardware seemingly aren’t affected). Did you maybe recently roll out new software which changed behaviour how your CrateDB cluster is used (different reads, inserts, updates, deletes)? Is there anything in your monitoring pointing to an issue? Is it possible that your cluster is overloaded?

For CrateDB clusters running on CrateDB Cloud I could give more details as it includes our monitoring, logging, 24x7 alerting, support, backups, etc. It’s difficult to diagnose such outlier issues without this from afar

gruselglatz · July 13, 2022, 3:06pm

Ok so is it possible to drop only a specific shard then?

jayeff · July 14, 2022, 7:47am

Sorry, but this isn’t supported by CrateDB

gruselglatz · July 14, 2022, 8:03am

OK, thanks. Is there a way to get rid of the files and initialize an empty shard?

jayeff · July 14, 2022, 8:24am

I never tried this before so cannot say if this would work. If you try it out I would suggest to test it in a dev environment beforehand and recommend to take a full backup in case anything goes awry

Topic		Replies	Views
CrateDB 3 Node Cluster: Replica Shards are missing CrateDB	3	419	January 19, 2023
TranslogCorruptedException CrateDB	9	595	September 3, 2024
Shards replication error CrateDB	2	1620	February 12, 2020
Couldn't decrease number of shards for table CrateDB sql , data-storage	2	29	March 26, 2025
Allocating shards fails when setting routing.allocation.require._name CrateDB	3	169	April 19, 2024

Cannot allocate because all found copies of the shard are either stale or corrupt

Related topics