what version of CrateDB are you running? Can you please share the CREATE TABLE statement for the affected table with us (SHOW CREATE TABLE)? Is the node in good health? Did you try to restart the node?
When you say ALTER CLUSTER/TABLE command do not work what do you mean? What is the output for these queries? Do you get an error?
I use 4.8.1
Yes i restarted everything multiple times. All nodes are in good health, no watermarks or disk problems, no ram problems, no network problems.
I shortend the output of SHOW CREATE TABLE because i dont think our fields are the problem:
Does number_of_replicas = '0' also hold for all partitions?
If yes, then this will be the reason why REROUTE PRomote replica will not work as there is no replica to promote. I assume REROUTE RETRY FAILED does not work due to the shard being corrupt.
We do recommend to have at least 1 replica configured to prevent potential data loss.
Do you have a snapshot/backup of your partition that you can restore?
How can the shard be corrupt, when lucene check said it is not corrupt? and even cratedb dont log something that would indicate it is corrupt?
No i dont have a snapshot of this shard, the other 2 in this partion work flawlessly and it occured all of a sudden. It resides on a filesystem which is protected against corruption, so not a single disk or something.
OK so there is no chance to bring the shard back to live, even when lucene check finds no error?
Can I maybe find more info in some deeper log? Or can i force allocate the shard somehow?
@jayeff all of a sudden 3 more shards in the same index got corrupt. Without any error in the logs or any hardware/network failure. Same as above, lucene dont see any errors when I check them.
IDK what to do anymore. Simply holding replicas for the case that shards can go corrupt all the time is a little bit strange. We also have some bigger Elasticsearch/Opensearch clusters, on the same Hardwarebase and we never saw things like that.
Number of replicas does not change by itself. Maybe the value was changed at some point with ALTER TABLE SET.
Unfortunately the affected partition seems to be one without a replica
I would recommend that you update existing partitions with current number of replicas set to zero to 1 replica.
I’m afraid that it will won’t be possible to bring this shard back Force allocation would be done by REROUTE RETRY FAILED or REROUTE PROMOTE REPLICA which did not work in your case.
Really sorry to hear. We never saw such a behaviour in the CrateDB clusters we host and I don’t have an explanation what is going on.
That said shards in CrateDB don’t simply randomly go corrupt so I believe something else most be afoot.
My guess would be something related to hardware or networking (but strange that other clusters on the same hardware seemingly aren’t affected). Did you maybe recently roll out new software which changed behaviour how your CrateDB cluster is used (different reads, inserts, updates, deletes)? Is there anything in your monitoring pointing to an issue? Is it possible that your cluster is overloaded?
For CrateDB clusters running on CrateDB Cloud I could give more details as it includes our monitoring, logging, 24x7 alerting, support, backups, etc. It’s difficult to diagnose such outlier issues without this from afar
I never tried this before so cannot say if this would work. If you try it out I would suggest to test it in a dev environment beforehand and recommend to take a full backup in case anything goes awry