Crate 5.6.5 : failed engine [lucene commit failed] -> DB corrupted and not starting anymore

Hi,
I’ve just stumbled on a DB corruption error preventing the whole DB from starting. This is crate 5.6.5 on a one node linux machine.
I’ve been using crate for 7 years and it’s the first time something like that happens.
Looking at Linux logs I found this :

Aug 30 21:45:16 crate[1633]: [2024-08-30T21:45:16,315][INFO ][o.e.c.r.a.AllocationService] [Pied Moutet] Cluster health status changed from [GREEN] to [RED] (reason: [shards failed [[ocell..partitioned.indicators.04732dpg6go3cdpi60o30c1g][0]]]).
**Aug 30 21:45:16 kernel: EXT4-fs error (device sdc1): ext4_validate_block_bitmap:423: comm cratedb[Pied Mo: bg 50: bad block bitmap checksum**
**Aug 30 21:45:16 kernel: EXT4-fs error (device sdc1) in ext4_mb_clear_bb:6551: Filesystem failed CRC**
Aug 30 21:45:26 crate[1633]: [2024-08-30T21:45:26,639][WARN ][o.e.t.ThreadPool         ] [Pied Moutet] failed to run scheduled task [org.elasticsearch.indices.IndexingMemoryController$ShardsIndicesStatusChecker@3a67d4fd] on thread pool [same]
Aug 30 21:45:26 crate[1633]: org.apache.lucene.store.AlreadyClosedException: this IndexWriter is closed

So it appears that a SSD block is corrupted and caused the root issue; But why is the whole DB failing after such an event ?

I’ve tried to restart the crate instance but now it’s not starting anymore :frowning:
Fortunately (?) this is a test server not the production one :fearful:
He’re a excerpt of the logs:

[2024-08-30T21:45:15,771][WARN ][o.e.i.e.Engine           ] [Pied Moutet]  [ocell..partitioned.indicators.04732dpg6go3cdpi60o30c1g][0]failed engine [lucene commit failed]
org.apache.lucene.index.CorruptIndexException: codec footer mismatch (file truncated?): actual footer=876098357 vs expected footer=-1071082520 (resource=BufferedChecksumIndexInput(NIOFSIndexInput(path="/data/crate/crate_5x/nodes/0/indices/1KiOGtIvQyKzIIj-ex4QMQ/0/index/_3wu7e.si")))
        at org.apache.lucene.codecs.CodecUtil.validateFooter(CodecUtil.java:585) ~[lucene-core-9.9.1.jar:9.9.1 eee32cbf5e072a8c9d459c349549094230038308 - 2023-12-13 11:03:02]
        at org.apache.lucene.codecs.CodecUtil.checkFooter(CodecUtil.java:433) ~[lucene-core-9.9.1.jar:9.9.1 eee32cbf5e072a8c9d459c349549094230038308 - 2023-12-13 11:03:02]
        at org.apache.lucene.codecs.CodecUtil.checkFooter(CodecUtil.java:493) ~[lucene-core-9.9.1.jar:9.9.1 eee32cbf5e072a8c9d459c349549094230038308 - 2023-12-13 11:03:02]
        at org.apache.lucene.codecs.lucene99.Lucene99SegmentInfoFormat.read(Lucene99SegmentInfoFormat.java:104) ~[lucene-core-9.9.1.jar:9.9.1 eee32cbf5e072a8c9d459c349549094230038308 - 2023-12-13 11:03:02]
        at org.apache.lucene.index.SegmentInfos.parseSegmentInfos(SegmentInfos.java:411) ~[lucene-core-9.9.1.jar:9.9.1 eee32cbf5e072a8c9d459c349549094230038308 - 2023-12-13 11:03:02]
        at org.apache.lucene.index.SegmentInfos.readCommit(SegmentInfos.java:368) ~[lucene-core-9.9.1.jar:9.9.1 eee32cbf5e072a8c9d459c349549094230038308 - 2023-12-13 11:03:02]
        at org.apache.lucene.index.SegmentInfos.readCommit(SegmentInfos.java:304) ~[lucene-core-9.9.1.jar:9.9.1 eee32cbf5e072a8c9d459c349549094230038308 - 2023-12-13 11:03:02]
        at org.apache.lucene.index.SegmentInfos.readCommit(SegmentInfos.java:293) ~[lucene-core-9.9.1.jar:9.9.1 eee32cbf5e072a8c9d459c349549094230038308 - 2023-12-13 11:03:02]
        at org.elasticsearch.index.engine.CombinedDeletionPolicy.getDocCountOfCommit(CombinedDeletionPolicy.java:133) ~[crate-server-5.6.5.jar:?]
        at org.elasticsearch.index.engine.CombinedDeletionPolicy.onCommit(CombinedDeletionPolicy.java:105) ~[crate-server-5.6.5.jar:?]
        at org.apache.lucene.index.IndexFileDeleter.checkpoint(IndexFileDeleter.java:582) ~[lucene-core-9.9.1.jar:9.9.1 eee32cbf5e072a8c9d459c349549094230038308 - 2023-12-13 11:03:02]
        at org.apache.lucene.index.IndexWriter.finishCommit(IndexWriter.java:4157) ~[lucene-core-9.9.1.jar:9.9.1 eee32cbf5e072a8c9d459c349549094230038308 - 2023-12-13 11:03:02]
        at org.apache.lucene.index.IndexWriter.commitInternal(IndexWriter.java:4111) ~[lucene-core-9.9.1.jar:9.9.1 eee32cbf5e072a8c9d459c349549094230038308 - 2023-12-13 11:03:02]
        at org.apache.lucene.index.IndexWriter.commit(IndexWriter.java:4065) ~[lucene-core-9.9.1.jar:9.9.1 eee32cbf5e072a8c9d459c349549094230038308 - 2023-12-13 11:03:02]
        at org.elasticsearch.index.engine.InternalEngine.commitIndexWriter(InternalEngine.java:2472) ~[crate-server-5.6.5.jar:?]
        at org.elasticsearch.index.engine.InternalEngine.flush(InternalEngine.java:1868) ~[crate-server-5.6.5.jar:?]
        at org.elasticsearch.index.shard.IndexShard.flush(IndexShard.java:1055) ~[crate-server-5.6.5.jar:?]
        at org.elasticsearch.indices.flush.SyncedFlushService$2.doRun(SyncedFlushService.java:154) ~[crate-server-5.6.5.jar:?]
        at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37) ~[crate-server-5.6.5.jar:?]
        at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1144) ~[?:?]
        at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:642) ~[?:?]
        at java.base/java.lang.Thread.run(Thread.java:1583) [?:?]
        Suppressed: org.apache.lucene.index.CorruptIndexException: codec header mismatch: actual header=876098357 vs expected header=1071082519 (resource=BufferedChecksumIndexInput(NIOFSIndexInput(path="/data/crate/crate_5x/nodes/0/indices/1KiOGtIvQyKzIIj-ex4QMQ/0/index/_3wu7e.si")))
                at org.apache.lucene.codecs.CodecUtil.checkHeader(CodecUtil.java:187) ~[lucene-core-9.9.1.jar:9.9.1 eee32cbf5e072a8c9d459c349549094230038308 - 2023-12-13 11:03:02]
                at org.apache.lucene.codecs.CodecUtil.checkIndexHeader(CodecUtil.java:254) ~[lucene-core-9.9.1.jar:9.9.1 eee32cbf5e072a8c9d459c349549094230038308 - 2023-12-13 11:03:02]
                at org.apache.lucene.codecs.lucene99.Lucene99SegmentInfoFormat.read(Lucene99SegmentInfoFormat.java:98) ~[lucene-core-9.9.1.jar:9.9.1 eee32cbf5e072a8c9d459c349549094230038308 - 2023-12-13 11:03:02]
                at org.apache.lucene.index.SegmentInfos.parseSegmentInfos(SegmentInfos.java:411) ~[lucene-core-9.9.1.jar:9.9.1 eee32cbf5e072a8c9d459c349549094230038308 - 2023-12-13 11:03:02]
                at org.apache.lucene.index.SegmentInfos.readCommit(SegmentInfos.java:368) ~[lucene-core-9.9.1.jar:9.9.1 eee32cbf5e072a8c9d459c349549094230038308 - 2023-12-13 11:03:02]
                at org.apache.lucene.index.SegmentInfos.readCommit(SegmentInfos.java:304) ~[lucene-core-9.9.1.jar:9.9.1 eee32cbf5e072a8c9d459c349549094230038308 - 2023-12-13 11:03:02]
                at org.apache.lucene.index.SegmentInfos.readCommit(SegmentInfos.java:293) ~[lucene-core-9.9.1.jar:9.9.1 eee32cbf5e072a8c9d459c349549094230038308 - 2023-12-13 11:03:02]
                at org.elasticsearch.index.engine.CombinedDeletionPolicy.getDocCountOfCommit(CombinedDeletionPolicy.java:133) ~[crate-server-5.6.5.jar:?]
                at org.elasticsearch.index.engine.CombinedDeletionPolicy.onCommit(CombinedDeletionPolicy.java:105) ~[crate-server-5.6.5.jar:?]
                at org.apache.lucene.index.IndexFileDeleter.checkpoint(IndexFileDeleter.java:582) ~[lucene-core-9.9.1.jar:9.9.1 eee32cbf5e072a8c9d459c349549094230038308 - 2023-12-13 11:03:02]
                at org.apache.lucene.index.IndexWriter.finishCommit(IndexWriter.java:4157) ~[lucene-core-9.9.1.jar:9.9.1 eee32cbf5e072a8c9d459c349549094230038308 - 2023-12-13 11:03:02]
                at org.apache.lucene.index.IndexWriter.commitInternal(IndexWriter.java:4111) ~[lucene-core-9.9.1.jar:9.9.1 eee32cbf5e072a8c9d459c349549094230038308 - 2023-12-13 11:03:02]
                at org.apache.lucene.index.IndexWriter.commit(IndexWriter.java:4065) ~[lucene-core-9.9.1.jar:9.9.1 eee32cbf5e072a8c9d459c349549094230038308 - 2023-12-13 11:03:02]
                at org.elasticsearch.index.engine.InternalEngine.commitIndexWriter(InternalEngine.java:2472) ~[crate-server-5.6.5.jar:?]
                at org.elasticsearch.index.engine.InternalEngine.flush(InternalEngine.java:1868) ~[crate-server-5.6.5.jar:?]
                at org.elasticsearch.index.shard.IndexShard.flush(IndexShard.java:1055) ~[crate-server-5.6.5.jar:?]
                at org.elasticsearch.indices.flush.SyncedFlushService$2.doRun(SyncedFlushService.java:154) ~[crate-server-5.6.5.jar:?]
                at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37) ~[crate-server-5.6.5.jar:?]
                at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1144) ~[?:?]
                at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:642) ~[?:?]
                at java.base/java.lang.Thread.run(Thread.java:1583) [?:?]
        Suppressed: org.apache.lucene.index.CorruptIndexException: checksum passed (391a9e8c). possibly transient resource issue, or a Lucene or JVM bug (resource=BufferedChecksumIndexInput(NIOFSIndexInput(path="/data/crate/crate_5x/nodes/0/indices/1KiOGtIvQyKzIIj-ex4QMQ/0/index/segments_26")))
                at org.apache.lucene.codecs.CodecUtil.checkFooter(CodecUtil.java:501) ~[lucene-core-9.9.1.jar:9.9.1 eee32cbf5e072a8c9d459c349549094230038308 - 2023-12-13 11:03:02]
                at org.apache.lucene.index.SegmentInfos.readCommit(SegmentInfos.java:375) ~[lucene-core-9.9.1.jar:9.9.1 eee32cbf5e072a8c9d459c349549094230038308 - 2023-12-13 11:03:02]
                at org.apache.lucene.index.SegmentInfos.readCommit(SegmentInfos.java:304) ~[lucene-core-9.9.1.jar:9.9.1 eee32cbf5e072a8c9d459c349549094230038308 - 2023-12-13 11:03:02]
                at org.apache.lucene.index.SegmentInfos.readCommit(SegmentInfos.java:293) ~[lucene-core-9.9.1.jar:9.9.1 eee32cbf5e072a8c9d459c349549094230038308 - 2023-12-13 11:03:02]
                at org.elasticsearch.index.engine.CombinedDeletionPolicy.getDocCountOfCommit(CombinedDeletionPolicy.java:133) ~[crate-server-5.6.5.jar:?]
                at org.elasticsearch.index.engine.CombinedDeletionPolicy.onCommit(CombinedDeletionPolicy.java:105) ~[crate-server-5.6.5.jar:?]
                at org.apache.lucene.index.IndexFileDeleter.checkpoint(IndexFileDeleter.java:582) ~[lucene-core-9.9.1.jar:9.9.1 eee32cbf5e072a8c9d459c349549094230038308 - 2023-12-13 11:03:02]
                at org.apache.lucene.index.IndexWriter.finishCommit(IndexWriter.java:4157) ~[lucene-core-9.9.1.jar:9.9.1 eee32cbf5e072a8c9d459c349549094230038308 - 2023-12-13 11:03:02]
                at org.apache.lucene.index.IndexWriter.commitInternal(IndexWriter.java:4111) ~[lucene-core-9.9.1.jar:9.9.1 eee32cbf5e072a8c9d459c349549094230038308 - 2023-12-13 11:03:02]
                at org.apache.lucene.index.IndexWriter.commit(IndexWriter.java:4065) ~[lucene-core-9.9.1.jar:9.9.1 eee32cbf5e072a8c9d459c349549094230038308 - 2023-12-13 11:03:02]
                at org.elasticsearch.index.engine.InternalEngine.commitIndexWriter(InternalEngine.java:2472) ~[crate-server-5.6.5.jar:?]
                at org.elasticsearch.index.engine.InternalEngine.flush(InternalEngine.java:1868) ~[crate-server-5.6.5.jar:?]
                at org.elasticsearch.index.shard.IndexShard.flush(IndexShard.java:1055) ~[crate-server-5.6.5.jar:?]
                at org.elasticsearch.indices.flush.SyncedFlushService$2.doRun(SyncedFlushService.java:154) ~[crate-server-5.6.5.jar:?]

Full log of when the issued happened (08-30) as well as latest log with restart failures …
Anyone knows how to recover from this error ?
Thanks

o-cell-test-2024-08-30.log (151.8 KB)
o-cell-test.log (2.6 MB)

This relates to TranslogCorruptedException, see also TranslogCorruptedException - #7 by smu

Thanks I had found these issues and tried to use the es shard recovery tool but as you explained it’s not supported anymore :frowning:
SSD was dead so I had no choice but to restart from a production backup on my test server.