TranslogCorruptedException

versatildefuy · October 25, 2022, 7:28am

Hi guys,

I am running a one-node deployment, version 4.1.8, for a small deployment. After a reboot, I encountered the following problem:

servicesnode.1.o10owwaac4vg@b2-15-smart-city    | org.elasticsearch.indices.recovery.RecoveryFailedException: [mtwastemanagement.etwastecontainer][0]: Recovery failed on {servicesnode}{NQOmfBYLRZ-w6g4pQagm9w}{iY4_Q7uwRuqKVOQ66ZJnhQ}{10.0.1.134}{10.0.1.134:4300}{http_address=10.0.1.134:4200}
servicesnode.1.o10owwaac4vg@b2-15-smart-city    | 	at org.elasticsearch.index.shard.IndexShard.lambda$startRecovery$6(IndexShard.java:2044) [crate-app.jar:4.1.8]
servicesnode.1.o10owwaac4vg@b2-15-smart-city    | 	at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:624) [crate-app.jar:4.1.8]
servicesnode.1.o10owwaac4vg@b2-15-smart-city    | 	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) [?:?]
servicesnode.1.o10owwaac4vg@b2-15-smart-city    | 	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) [?:?]
servicesnode.1.o10owwaac4vg@b2-15-smart-city    | 	at java.lang.Thread.run(Thread.java:830) [?:?]
servicesnode.1.o10owwaac4vg@b2-15-smart-city    | Caused by: org.elasticsearch.index.shard.IndexShardRecoveryException: failed recovery
servicesnode.1.o10owwaac4vg@b2-15-smart-city    | 	at org.elasticsearch.index.shard.StoreRecovery.executeRecovery(StoreRecovery.java:347) ~[crate-app.jar:4.1.8]
servicesnode.1.o10owwaac4vg@b2-15-smart-city    | 	at org.elasticsearch.index.shard.StoreRecovery.recoverFromStore(StoreRecovery.java:93) ~[crate-app.jar:4.1.8]
servicesnode.1.o10owwaac4vg@b2-15-smart-city    | 	at org.elasticsearch.index.shard.IndexShard.recoverFromStore(IndexShard.java:1553) ~[crate-app.jar:4.1.8]
servicesnode.1.o10owwaac4vg@b2-15-smart-city    | 	at org.elasticsearch.index.shard.IndexShard.lambda$startRecovery$6(IndexShard.java:2040) ~[crate-app.jar:4.1.8]
servicesnode.1.o10owwaac4vg@b2-15-smart-city    | 	... 4 more
servicesnode.1.o10owwaac4vg@b2-15-smart-city    | Caused by: org.elasticsearch.index.translog.TranslogCorruptedException: translog from source [/data/data/nodes/0/indices/V2nTFdlwQN-berRDG4kQtQ/0/translog] is corrupted
servicesnode.1.o10owwaac4vg@b2-15-smart-city    | 	at org.elasticsearch.index.translog.Translog.readCheckpoint(Translog.java:1809) ~[crate-app.jar:4.1.8]
servicesnode.1.o10owwaac4vg@b2-15-smart-city    | 	at org.elasticsearch.index.translog.Translog.readGlobalCheckpoint(Translog.java:1796) ~[crate-app.jar:4.1.8]
servicesnode.1.o10owwaac4vg@b2-15-smart-city    | 	at org.elasticsearch.index.shard.IndexShard.innerOpenEngineAndTranslog(IndexShard.java:1319) ~[crate-app.jar:4.1.8]
servicesnode.1.o10owwaac4vg@b2-15-smart-city    | 	at org.elasticsearch.index.shard.IndexShard.openEngineAndRecoverFromTranslog(IndexShard.java:1282) ~[crate-app.jar:4.1.8]
servicesnode.1.o10owwaac4vg@b2-15-smart-city    | 	at org.elasticsearch.index.shard.StoreRecovery.internalRecoverFromStore(StoreRecovery.java:426) ~[crate-app.jar:4.1.8]
servicesnode.1.o10owwaac4vg@b2-15-smart-city    | 	at org.elasticsearch.index.shard.StoreRecovery.lambda$recoverFromStore$0(StoreRecovery.java:95) ~[crate-app.jar:4.1.8]
servicesnode.1.o10owwaac4vg@b2-15-smart-city    | 	at org.elasticsearch.index.shard.StoreRecovery.executeRecovery(StoreRecovery.java:303) ~[crate-app.jar:4.1.8]
servicesnode.1.o10owwaac4vg@b2-15-smart-city    | 	at org.elasticsearch.index.shard.StoreRecovery.recoverFromStore(StoreRecovery.java:93) ~[crate-app.jar:4.1.8]
servicesnode.1.o10owwaac4vg@b2-15-smart-city    | 	at org.elasticsearch.index.shard.IndexShard.recoverFromStore(IndexShard.java:1553) ~[crate-app.jar:4.1.8]
servicesnode.1.o10owwaac4vg@b2-15-smart-city    | 	at org.elasticsearch.index.shard.IndexShard.lambda$startRecovery$6(IndexShard.java:2040) ~[crate-app.jar:4.1.8]
servicesnode.1.o10owwaac4vg@b2-15-smart-city    | 	... 4 more
servicesnode.1.o10owwaac4vg@b2-15-smart-city    | Caused by: java.nio.file.NoSuchFileException: /data/data/nodes/0/indices/V2nTFdlwQN-berRDG4kQtQ/0/translog/translog-6537.tlog
servicesnode.1.o10owwaac4vg@b2-15-smart-city    | 	at sun.nio.fs.UnixException.translateToIOException(UnixException.java:92) ~[?:?]
servicesnode.1.o10owwaac4vg@b2-15-smart-city    | 	at sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:111) ~[?:?]
servicesnode.1.o10owwaac4vg@b2-15-smart-city    | 	at sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:116) ~[?:?]
servicesnode.1.o10owwaac4vg@b2-15-smart-city    | 	at sun.nio.fs.UnixFileSystemProvider.newFileChannel(UnixFileSystemProvider.java:182) ~[?:?]
servicesnode.1.o10owwaac4vg@b2-15-smart-city    | 	at java.nio.channels.FileChannel.open(FileChannel.java:292) ~[?:?]
servicesnode.1.o10owwaac4vg@b2-15-smart-city    | 	at java.nio.channels.FileChannel.open(FileChannel.java:345) ~[?:?]
servicesnode.1.o10owwaac4vg@b2-15-smart-city    | 	at org.elasticsearch.index.translog.Translog.readCheckpoint(Translog.java:1804) ~[crate-app.jar:4.1.8]
servicesnode.1.o10owwaac4vg@b2-15-smart-city    | 	at org.elasticsearch.index.translog.Translog.readGlobalCheckpoint(Translog.java:1796) ~[crate-app.jar:4.1.8]
servicesnode.1.o10owwaac4vg@b2-15-smart-city    | 	at org.elasticsearch.index.shard.IndexShard.innerOpenEngineAndTranslog(IndexShard.java:1319) ~[crate-app.jar:4.1.8]
servicesnode.1.o10owwaac4vg@b2-15-smart-city    | 	at org.elasticsearch.index.shard.IndexShard.openEngineAndRecoverFromTranslog(IndexShard.java:1282) ~[crate-app.jar:4.1.8]
servicesnode.1.o10owwaac4vg@b2-15-smart-city    | 	at org.elasticsearch.index.shard.StoreRecovery.internalRecoverFromStore(StoreRecovery.java:426) ~[crate-app.jar:4.1.8]
servicesnode.1.o10owwaac4vg@b2-15-smart-city    | 	at org.elasticsearch.index.shard.StoreRecovery.lambda$recoverFromStore$0(StoreRecovery.java:95) ~[crate-app.jar:4.1.8]
servicesnode.1.o10owwaac4vg@b2-15-smart-city    | 	at org.elasticsearch.index.shard.StoreRecovery.executeRecovery(StoreRecovery.java:303) ~[crate-app.jar:4.1.8]
servicesnode.1.o10owwaac4vg@b2-15-smart-city    | 	at org.elasticsearch.index.shard.StoreRecovery.recoverFromStore(StoreRecovery.java:93) ~[crate-app.jar:4.1.8]
servicesnode.1.o10owwaac4vg@b2-15-smart-city    | 	at org.elasticsearch.index.shard.IndexShard.recoverFromStore(IndexShard.java:1553) ~[crate-app.jar:4.1.8]
servicesnode.1.o10owwaac4vg@b2-15-smart-city    | 	at org.elasticsearch.index.shard.IndexShard.lambda$startRecovery$6(IndexShard.java:2040) ~[crate-app.jar:4.1.8]
servicesnode.1.o10owwaac4vg@b2-15-smart-city    | 	... 4 more

The status of the table in the graphical interface is as follows:

150,100 Registros (34.6 MB)
50034 Underreplicated Records / 50,034 Unavailable Records / 4 Shards / 2-all Replicas

Following what I have been able to read in the documentation and similar problems, I have executed the following command:

select * from sys.allocations where current_state != 'STARTED' limit 100;

{
  "cols": [
    "current_state",
    "decisions",
    "explanation",
    "node_id",
    "partition_ident",
    "primary",
    "shard_id",
    "table_name",
    "table_schema"
  ],
  "col_types": [
    4,
    [
      100,
      12
    ],
    4,
    4,
    4,
    3,
    9,
    4,
    4
  ],
  "rows": [
    [
      "UNASSIGNED",
      [
        {
          "node_name": "servicesnode",
          "explanations": [
            "shard has exceeded the maximum number of retries [5] on failed allocation attempts - manually execute 'ALTER CLUSTER REROUTE RETRY FAILED' to retry, [unassigned_info[[reason=ALLOCATION_FAILED], at[2022-10-25T07:06:47.401Z], failed_attempts[5], delayed=false, details[failed shard on node [NQOmfBYLRZ-w6g4pQagm9w]: failed recovery, failure RecoveryFailedException[[mtwastemanagement.etwastecontainer][0]: Recovery failed on {servicesnode}{NQOmfBYLRZ-w6g4pQagm9w}{iY4_Q7uwRuqKVOQ66ZJnhQ}{10.0.1.134}{10.0.1.134:4300}{http_address=10.0.1.134:4200}]; nested: IndexShardRecoveryException[failed recovery]; nested: TranslogCorruptedException[translog from source [/data/data/nodes/0/indices/V2nTFdlwQN-berRDG4kQtQ/0/translog] is corrupted]; nested: NoSuchFileException[/data/data/nodes/0/indices/V2nTFdlwQN-berRDG4kQtQ/0/translog/translog-6537.tlog]; ], allocation_status[deciders_no]]]"
          ],
          "node_id": "NQOmfBYLRZ-w6g4pQagm9w"
        }
      ],
      "cannot allocate because allocation is not permitted to any of the nodes that hold an in-sync shard copy",
      null,
      null,
      true,
      0,
      "etwastecontainer",
      "mtwastemanagement"
    ]
  ],
  "rowcount": 1,
  "duration": 3.479047
}

Following the recommendation in the explanation I have executed the command:

ALTER CLUSTER REROUTE RETRY FAILED

but the error is still displayed, as shown in the log.

At this point, I don’t know what else to do, do you know of a way to fix the problem? Or alternatively, I would accept losing the percentage of unavailable records if I can keep the rest of the records.

I appreciate any help you can provide.
Best regards

smu · October 25, 2022, 9:00am

Looks like the node may have been shutdown in an unclean state, e.g. translog files were not correctly fsynced to disk, some disk failure happened, etc.

In a multi-node setup with at least 1 replica configured, CrateDB would automatically recover from a healthy copy of the shard.

There are 2 options to recover from this corrupted state, both may result in a data loss unfortunately:

a) Download Elasticsearch and use their elasticsearch-shard command line tool, see elasticsearch-shard | Elasticsearch Guide [8.4] | Elastic. This tool will try to detect and repair the translog files. You’d need to pass the full path of the index directory using the --dir options.

b) Delete all files inside /data/data/nodes/0/indices/V2nTFdlwQN-berRDG4kQtQ/0/translog/. This may result in bigger data loss as option a) as it will delete the whole translog and such any records not yet commited to Lucene.

versatildefuy · October 25, 2022, 10:33am

Hi @smu , thank you for your quick response.

I have tried to follow the solution a). I have installed elasticsearch and run the command, but it seems I have not followed the proper installation method or I am missing some steps. Maybe you can help me.

[root@52a7db6d10dc bin]# ./elasticsearch-shard remove-corrupted-data --dir /data/data/nodes/0/indices/V2nTFdlwQN-berRDG4kQtQ
warning: ignoring JAVA_HOME=/opt/jdk-13.0.1; using bundled JDK
------------------------------------------------------------------------

    WARNING: Elasticsearch MUST be stopped before running this tool.

-----------------------------------------------------------------------

  Please make a complete backup of your index before using this tool.

-----------------------------------------------------------------------
Exception in thread "main" org.elasticsearch.ElasticsearchException: no node meta data is found, node has not been started yet?
	at org.elasticsearch.cluster.coordination.ElasticsearchNodeCommand.createPersistedClusterStateService(ElasticsearchNodeCommand.java:107)
	at org.elasticsearch.index.shard.RemoveCorruptedShardDataCommand.processDataPaths(RemoveCorruptedShardDataCommand.java:241)
	at org.elasticsearch.cluster.coordination.ElasticsearchNodeCommand.processDataPaths(ElasticsearchNodeCommand.java:142)
	at org.elasticsearch.cluster.coordination.ElasticsearchNodeCommand.execute(ElasticsearchNodeCommand.java:160)
	at org.elasticsearch.common.cli.EnvironmentAwareCommand.execute(EnvironmentAwareCommand.java:54)
	at org.elasticsearch.cli.Command.mainWithoutErrorHandling(Command.java:85)
	at org.elasticsearch.cli.MultiCommand.execute(MultiCommand.java:94)
	at org.elasticsearch.cli.Command.mainWithoutErrorHandling(Command.java:85)
	at org.elasticsearch.cli.Command.main(Command.java:50)
	at org.elasticsearch.launcher.CliToolLauncher.main(CliToolLauncher.java:64)

These are the steps I have followed for the installation.

$ rpm --import https://artifacts.elastic.co/GPG-KEY-elasticsearch
$ nano /etc/yum.repos.d/elasticsearch.repo
$ yum install --enablerepo=elasticsearch elasticsearch
$ cd /usr/share/elasticsearch/bin/

Thank you in advance.

smu · October 25, 2022, 2:25pm

Hi @versatildefuy,

hm right, the data directory must be passed using the -E path.data argument instead of the --dir one. Seems to me that the --dir option may not be usable, a bit strange.
So in your case the command should be:

./bin/elasticsearch-shard -E path.data=/data/data/nodes/0/ --index <INDEX_NAME> --shard 0

Be aware that you must use the index name and not the index UUID which is used for storing the data and also inside the exception.

smu · October 26, 2022, 7:38am

The --dir option can be used to pass in the index directory of the relevant shard directly, e.g. when the index name isn’t known.

./bin/elasticsearch-shard -E path.data=/data/data/nodes/0/ --dir /data/data/nodes/0/indices/V2nTFdlwQN-berRDG4kQtQ/0/index

echarlus · September 2, 2024, 10:42am

Hi,
I’m also facing a corrupted translog issue on crate 5.6.5.
I’m trying to follow your guide here (using es 8.12.2 which seems to be the one used in this crate version since it’s bundled with lucene 9.9 - same issue with newer versions)

And I’m getting this error :

./bin/elasticsearch-shard remove-corrupted-data -E path.data=/data/crate/crate_5x/nodes/0/ --dir /data/crate/crate_5x/nodes/0/indices/7sMeLMN8QY-BhSBLN4JP1Q/0/index/
------------------------------------------------------------------------

    WARNING: Elasticsearch MUST be stopped before running this tool.

sept. 02, 2024 12:37:46 PM org.apache.lucene.store.MemorySegmentIndexInputProvider <init>
INFO: Using MemorySegmentIndexInput with Java 21; to disable start with -Dorg.apache.lucene.store.MMapDirectory.enableMemorySegments=false
-----------------------------------------------------------------------

  Please make a complete backup of your index before using this tool.

-----------------------------------------------------------------------
Exception in thread "main" java.lang.IllegalArgumentException: Could not load codec 'Lucene99'. Did you forget to add lucene-backward-codecs.jar?
	at org.apache.lucene.index.SegmentInfos.readCodec(SegmentInfos.java:516)
	at org.apache.lucene.index.SegmentInfos.parseSegmentInfos(SegmentInfos.java:405)
	at org.apache.lucene.index.SegmentInfos.readCommit(SegmentInfos.java:364)
	at org.apache.lucene.index.SegmentInfos.readCommit(SegmentInfos.java:300)
	at org.apache.lucene.index.StandardDirectoryReader$1.doBody(StandardDirectoryReader.java:88)
	at org.apache.lucene.index.StandardDirectoryReader$1.doBody(StandardDirectoryReader.java:77)
	at org.apache.lucene.index.SegmentInfos$FindSegmentsFile.run(SegmentInfos.java:816)
	at org.apache.lucene.index.StandardDirectoryReader.open(StandardDirectoryReader.java:109)
	at org.apache.lucene.index.StandardDirectoryReader.open(StandardDirectoryReader.java:67)
	at org.apache.lucene.index.DirectoryReader.open(DirectoryReader.java:60)
	at org.elasticsearch.gateway.PersistedClusterStateService.nodeMetadata(PersistedClusterStateService.java:351)
	at org.elasticsearch.cluster.coordination.ElasticsearchNodeCommand.createPersistedClusterStateService(ElasticsearchNodeCommand.java:108)
	at org.elasticsearch.index.shard.RemoveCorruptedShardDataCommand.processDataPaths(RemoveCorruptedShardDataCommand.java:241)
	at org.elasticsearch.cluster.coordination.ElasticsearchNodeCommand.processDataPaths(ElasticsearchNodeCommand.java:145)
	at org.elasticsearch.cluster.coordination.ElasticsearchNodeCommand.execute(ElasticsearchNodeCommand.java:163)
	at org.elasticsearch.common.cli.EnvironmentAwareCommand.execute(EnvironmentAwareCommand.java:54)
	at org.elasticsearch.cli.Command.mainWithoutErrorHandling(Command.java:85)
	at org.elasticsearch.cli.MultiCommand.execute(MultiCommand.java:94)
	at org.elasticsearch.cli.Command.mainWithoutErrorHandling(Command.java:85)
	at org.elasticsearch.cli.Command.main(Command.java:50)
	at org.elasticsearch.launcher.CliToolLauncher.main(CliToolLauncher.java:64)
	Suppressed: org.apache.lucene.index.CorruptIndexException: checksum passed (33b38b16). possibly transient resource issue, or a Lucene or JVM bug (resource=BufferedChecksumIndexInput(NIOFSIndexInput(path="/data/crate/crate_5x/nodes/0/_state/segments_7ex")))
		at org.apache.lucene.codecs.CodecUtil.checkFooter(CodecUtil.java:500)
		at org.apache.lucene.index.SegmentInfos.readCommit(SegmentInfos.java:371)
		... 18 more
Caused by: java.lang.IllegalArgumentException: An SPI class of type org.apache.lucene.codecs.Codec with name 'Lucene99' does not exist.  You need to add the corresponding JAR file supporting this SPI to your classpath.  The current classpath supports the following names: [Lucene95, Lucene80, Lucene84, Lucene86, Lucene87, Lucene70, Lucene90, Lucene91, Lucene92, Lucene94]
	at org.apache.lucene.util.NamedSPILoader.lookup(NamedSPILoader.java:113)
	at org.apache.lucene.codecs.Codec.forName(Codec.java:118)
	at org.apache.lucene.index.SegmentInfos.readCodec(SegmentInfos.java:512)
	... 20 more

If I use older es versions (for example 8.11.4) I’m getting this one (which makes be believe that 8.12.2 is the version to use):

./bin/elasticsearch-shard remove-corrupted-data -E path.data=/data/crate/crate_5x/nodes/0/ --dir /data/crate/crate_5x/nodes/0/indices/7sMeLMN8QY-BhSBLN4JP1Q/0/index/
------------------------------------------------------------------------

    WARNING: Elasticsearch MUST be stopped before running this tool.

sept. 02, 2024 12:37:46 PM org.apache.lucene.store.MemorySegmentIndexInputProvider <init>
INFO: Using MemorySegmentIndexInput with Java 21; to disable start with -Dorg.apache.lucene.store.MMapDirectory.enableMemorySegments=false
-----------------------------------------------------------------------

  Please make a complete backup of your index before using this tool.

-----------------------------------------------------------------------
Exception in thread "main" java.lang.IllegalArgumentException: Could not load codec 'Lucene99'. Did you forget to add lucene-backward-codecs.jar?
	at org.apache.lucene.index.SegmentInfos.readCodec(SegmentInfos.java:516)
	at org.apache.lucene.index.SegmentInfos.parseSegmentInfos(SegmentInfos.java:405)
	at org.apache.lucene.index.SegmentInfos.readCommit(SegmentInfos.java:364)
	at org.apache.lucene.index.SegmentInfos.readCommit(SegmentInfos.java:300)
	at org.apache.lucene.index.StandardDirectoryReader$1.doBody(StandardDirectoryReader.java:88)
	at org.apache.lucene.index.StandardDirectoryReader$1.doBody(StandardDirectoryReader.java:77)
	at org.apache.lucene.index.SegmentInfos$FindSegmentsFile.run(SegmentInfos.java:816)
	at org.apache.lucene.index.StandardDirectoryReader.open(StandardDirectoryReader.java:109)
	at org.apache.lucene.index.StandardDirectoryReader.open(StandardDirectoryReader.java:67)
	at org.apache.lucene.index.DirectoryReader.open(DirectoryReader.java:60)
	at org.elasticsearch.gateway.PersistedClusterStateService.nodeMetadata(PersistedClusterStateService.java:351)
	at org.elasticsearch.cluster.coordination.ElasticsearchNodeCommand.createPersistedClusterStateService(ElasticsearchNodeCommand.java:108)
	at org.elasticsearch.index.shard.RemoveCorruptedShardDataCommand.processDataPaths(RemoveCorruptedShardDataCommand.java:241)
	at org.elasticsearch.cluster.coordination.ElasticsearchNodeCommand.processDataPaths(ElasticsearchNodeCommand.java:145)
	at org.elasticsearch.cluster.coordination.ElasticsearchNodeCommand.execute(ElasticsearchNodeCommand.java:163)
	at org.elasticsearch.common.cli.EnvironmentAwareCommand.execute(EnvironmentAwareCommand.java:54)
	at org.elasticsearch.cli.Command.mainWithoutErrorHandling(Command.java:85)
	at org.elasticsearch.cli.MultiCommand.execute(MultiCommand.java:94)
	at org.elasticsearch.cli.Command.mainWithoutErrorHandling(Command.java:85)
	at org.elasticsearch.cli.Command.main(Command.java:50)
	at org.elasticsearch.launcher.CliToolLauncher.main(CliToolLauncher.java:64)
	Suppressed: org.apache.lucene.index.CorruptIndexException: checksum passed (33b38b16). possibly transient resource issue, or a Lucene or JVM bug (resource=BufferedChecksumIndexInput(NIOFSIndexInput(path="/data/crate/crate_5x/nodes/0/_state/segments_7ex")))
		at org.apache.lucene.codecs.CodecUtil.checkFooter(CodecUtil.java:500)
		at org.apache.lucene.index.SegmentInfos.readCommit(SegmentInfos.java:371)
		... 18 more
Caused by: java.lang.IllegalArgumentException: An SPI class of type org.apache.lucene.codecs.Codec with name 'Lucene99' does not exist.  You need to add the corresponding JAR file supporting this SPI to your classpath.  The current classpath supports the following names: [Lucene95, Lucene80, Lucene84, Lucene86, Lucene87, Lucene70, Lucene90, Lucene91, Lucene92, Lucene94]
	at org.apache.lucene.util.NamedSPILoader.lookup(NamedSPILoader.java:113)
	at org.apache.lucene.codecs.Codec.forName(Codec.java:118)
	at org.apache.lucene.index.SegmentInfos.readCodec(SegmentInfos.java:512)
	... 20 more

Any idea of how to recover my translog and clean my corrupted data ?
Thanks

smu · September 3, 2024, 7:50am

Unfortunately for this situation, we moved away from using ES as a fork since CrateDB 4.0.0 (we integrated required parts into our code base). Thus, using any ES code independent isn’t working anymore. You could try to include all CrateDB and Lucene JAR files located at <CRATE-INSTALL>/lib into the class path of the elasticsearch-cli script which the elasticsearch-shard tool is using (see elasticsearch/distribution/src/bin/elasticsearch-shard at v8.12.2 · elastic/elasticsearch · GitHub and elasticsearch/distribution/src/bin/elasticsearch-cli at v8.12.2 · elastic/elasticsearch · GitHub.
This would load the required lucene libraries.
Also ensure that you are invoking it with the same Java version the related CrateDB version uses. For CrateDB 5.6.5 use Java 21, see crate/pom.xml at 5.6.5 · crate/crate · GitHub.

Another solution would be (that’s what the elasticsearch-shard CLI is actually doing) to remove the files which are corrupted manually while the CrateDB instance isn’t running.
Of course both ways will result in a data loss.

smu · September 3, 2024, 7:56am

I’ve created a related feature request Add CLI tool to recover a corrupted shard · Issue #16561 · crate/crate · GitHub, please upvote, thanks!

echarlus · September 3, 2024, 8:21am

Thanks for creating the issue. I’ve upvoted

echarlus · September 3, 2024, 8:23am

Thanks for the explanation. I finally figured out that the whole SSD seemed to be dead and I had to restart from a production backup on a new SSD.
To avoid this kind of situation, I’ll add a second node on my test server.

Topic		Replies	Views
Crate 5.6.5 : failed engine [lucene commit failed] -> DB corrupted and not starting anymore CrateDB	2	80	September 3, 2024
What can we do when a silo is missing in a table? CrateDB	4	300	August 29, 2023
Service not start , Format version is not supported CrateDB	12	1269	January 12, 2022
Cannot allocate because all found copies of the shard are either stale or corrupt CrateDB	12	1453	July 14, 2022
Frozen Cluster with FileSystemException CrateDB	3	1467	February 18, 2019

TranslogCorruptedException

Related topics