Hi guys,
I am running a one-node deployment, version 4.1.8, for a small deployment. After a reboot, I encountered the following problem:
servicesnode.1.o10owwaac4vg@b2-15-smart-city | org.elasticsearch.indices.recovery.RecoveryFailedException: [mtwastemanagement.etwastecontainer][0]: Recovery failed on {servicesnode}{NQOmfBYLRZ-w6g4pQagm9w}{iY4_Q7uwRuqKVOQ66ZJnhQ}{10.0.1.134}{10.0.1.134:4300}{http_address=10.0.1.134:4200}
servicesnode.1.o10owwaac4vg@b2-15-smart-city | at org.elasticsearch.index.shard.IndexShard.lambda$startRecovery$6(IndexShard.java:2044) [crate-app.jar:4.1.8]
servicesnode.1.o10owwaac4vg@b2-15-smart-city | at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:624) [crate-app.jar:4.1.8]
servicesnode.1.o10owwaac4vg@b2-15-smart-city | at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) [?:?]
servicesnode.1.o10owwaac4vg@b2-15-smart-city | at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) [?:?]
servicesnode.1.o10owwaac4vg@b2-15-smart-city | at java.lang.Thread.run(Thread.java:830) [?:?]
servicesnode.1.o10owwaac4vg@b2-15-smart-city | Caused by: org.elasticsearch.index.shard.IndexShardRecoveryException: failed recovery
servicesnode.1.o10owwaac4vg@b2-15-smart-city | at org.elasticsearch.index.shard.StoreRecovery.executeRecovery(StoreRecovery.java:347) ~[crate-app.jar:4.1.8]
servicesnode.1.o10owwaac4vg@b2-15-smart-city | at org.elasticsearch.index.shard.StoreRecovery.recoverFromStore(StoreRecovery.java:93) ~[crate-app.jar:4.1.8]
servicesnode.1.o10owwaac4vg@b2-15-smart-city | at org.elasticsearch.index.shard.IndexShard.recoverFromStore(IndexShard.java:1553) ~[crate-app.jar:4.1.8]
servicesnode.1.o10owwaac4vg@b2-15-smart-city | at org.elasticsearch.index.shard.IndexShard.lambda$startRecovery$6(IndexShard.java:2040) ~[crate-app.jar:4.1.8]
servicesnode.1.o10owwaac4vg@b2-15-smart-city | ... 4 more
servicesnode.1.o10owwaac4vg@b2-15-smart-city | Caused by: org.elasticsearch.index.translog.TranslogCorruptedException: translog from source [/data/data/nodes/0/indices/V2nTFdlwQN-berRDG4kQtQ/0/translog] is corrupted
servicesnode.1.o10owwaac4vg@b2-15-smart-city | at org.elasticsearch.index.translog.Translog.readCheckpoint(Translog.java:1809) ~[crate-app.jar:4.1.8]
servicesnode.1.o10owwaac4vg@b2-15-smart-city | at org.elasticsearch.index.translog.Translog.readGlobalCheckpoint(Translog.java:1796) ~[crate-app.jar:4.1.8]
servicesnode.1.o10owwaac4vg@b2-15-smart-city | at org.elasticsearch.index.shard.IndexShard.innerOpenEngineAndTranslog(IndexShard.java:1319) ~[crate-app.jar:4.1.8]
servicesnode.1.o10owwaac4vg@b2-15-smart-city | at org.elasticsearch.index.shard.IndexShard.openEngineAndRecoverFromTranslog(IndexShard.java:1282) ~[crate-app.jar:4.1.8]
servicesnode.1.o10owwaac4vg@b2-15-smart-city | at org.elasticsearch.index.shard.StoreRecovery.internalRecoverFromStore(StoreRecovery.java:426) ~[crate-app.jar:4.1.8]
servicesnode.1.o10owwaac4vg@b2-15-smart-city | at org.elasticsearch.index.shard.StoreRecovery.lambda$recoverFromStore$0(StoreRecovery.java:95) ~[crate-app.jar:4.1.8]
servicesnode.1.o10owwaac4vg@b2-15-smart-city | at org.elasticsearch.index.shard.StoreRecovery.executeRecovery(StoreRecovery.java:303) ~[crate-app.jar:4.1.8]
servicesnode.1.o10owwaac4vg@b2-15-smart-city | at org.elasticsearch.index.shard.StoreRecovery.recoverFromStore(StoreRecovery.java:93) ~[crate-app.jar:4.1.8]
servicesnode.1.o10owwaac4vg@b2-15-smart-city | at org.elasticsearch.index.shard.IndexShard.recoverFromStore(IndexShard.java:1553) ~[crate-app.jar:4.1.8]
servicesnode.1.o10owwaac4vg@b2-15-smart-city | at org.elasticsearch.index.shard.IndexShard.lambda$startRecovery$6(IndexShard.java:2040) ~[crate-app.jar:4.1.8]
servicesnode.1.o10owwaac4vg@b2-15-smart-city | ... 4 more
servicesnode.1.o10owwaac4vg@b2-15-smart-city | Caused by: java.nio.file.NoSuchFileException: /data/data/nodes/0/indices/V2nTFdlwQN-berRDG4kQtQ/0/translog/translog-6537.tlog
servicesnode.1.o10owwaac4vg@b2-15-smart-city | at sun.nio.fs.UnixException.translateToIOException(UnixException.java:92) ~[?:?]
servicesnode.1.o10owwaac4vg@b2-15-smart-city | at sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:111) ~[?:?]
servicesnode.1.o10owwaac4vg@b2-15-smart-city | at sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:116) ~[?:?]
servicesnode.1.o10owwaac4vg@b2-15-smart-city | at sun.nio.fs.UnixFileSystemProvider.newFileChannel(UnixFileSystemProvider.java:182) ~[?:?]
servicesnode.1.o10owwaac4vg@b2-15-smart-city | at java.nio.channels.FileChannel.open(FileChannel.java:292) ~[?:?]
servicesnode.1.o10owwaac4vg@b2-15-smart-city | at java.nio.channels.FileChannel.open(FileChannel.java:345) ~[?:?]
servicesnode.1.o10owwaac4vg@b2-15-smart-city | at org.elasticsearch.index.translog.Translog.readCheckpoint(Translog.java:1804) ~[crate-app.jar:4.1.8]
servicesnode.1.o10owwaac4vg@b2-15-smart-city | at org.elasticsearch.index.translog.Translog.readGlobalCheckpoint(Translog.java:1796) ~[crate-app.jar:4.1.8]
servicesnode.1.o10owwaac4vg@b2-15-smart-city | at org.elasticsearch.index.shard.IndexShard.innerOpenEngineAndTranslog(IndexShard.java:1319) ~[crate-app.jar:4.1.8]
servicesnode.1.o10owwaac4vg@b2-15-smart-city | at org.elasticsearch.index.shard.IndexShard.openEngineAndRecoverFromTranslog(IndexShard.java:1282) ~[crate-app.jar:4.1.8]
servicesnode.1.o10owwaac4vg@b2-15-smart-city | at org.elasticsearch.index.shard.StoreRecovery.internalRecoverFromStore(StoreRecovery.java:426) ~[crate-app.jar:4.1.8]
servicesnode.1.o10owwaac4vg@b2-15-smart-city | at org.elasticsearch.index.shard.StoreRecovery.lambda$recoverFromStore$0(StoreRecovery.java:95) ~[crate-app.jar:4.1.8]
servicesnode.1.o10owwaac4vg@b2-15-smart-city | at org.elasticsearch.index.shard.StoreRecovery.executeRecovery(StoreRecovery.java:303) ~[crate-app.jar:4.1.8]
servicesnode.1.o10owwaac4vg@b2-15-smart-city | at org.elasticsearch.index.shard.StoreRecovery.recoverFromStore(StoreRecovery.java:93) ~[crate-app.jar:4.1.8]
servicesnode.1.o10owwaac4vg@b2-15-smart-city | at org.elasticsearch.index.shard.IndexShard.recoverFromStore(IndexShard.java:1553) ~[crate-app.jar:4.1.8]
servicesnode.1.o10owwaac4vg@b2-15-smart-city | at org.elasticsearch.index.shard.IndexShard.lambda$startRecovery$6(IndexShard.java:2040) ~[crate-app.jar:4.1.8]
servicesnode.1.o10owwaac4vg@b2-15-smart-city | ... 4 more
The status of the table in the graphical interface is as follows:
150,100 Registros (34.6 MB)
50034 Underreplicated Records / 50,034 Unavailable Records / 4 Shards / 2-all Replicas
Following what I have been able to read in the documentation and similar problems, I have executed the following command:
select * from sys.allocations where current_state != 'STARTED' limit 100;
{
"cols": [
"current_state",
"decisions",
"explanation",
"node_id",
"partition_ident",
"primary",
"shard_id",
"table_name",
"table_schema"
],
"col_types": [
4,
[
100,
12
],
4,
4,
4,
3,
9,
4,
4
],
"rows": [
[
"UNASSIGNED",
[
{
"node_name": "servicesnode",
"explanations": [
"shard has exceeded the maximum number of retries [5] on failed allocation attempts - manually execute 'ALTER CLUSTER REROUTE RETRY FAILED' to retry, [unassigned_info[[reason=ALLOCATION_FAILED], at[2022-10-25T07:06:47.401Z], failed_attempts[5], delayed=false, details[failed shard on node [NQOmfBYLRZ-w6g4pQagm9w]: failed recovery, failure RecoveryFailedException[[mtwastemanagement.etwastecontainer][0]: Recovery failed on {servicesnode}{NQOmfBYLRZ-w6g4pQagm9w}{iY4_Q7uwRuqKVOQ66ZJnhQ}{10.0.1.134}{10.0.1.134:4300}{http_address=10.0.1.134:4200}]; nested: IndexShardRecoveryException[failed recovery]; nested: TranslogCorruptedException[translog from source [/data/data/nodes/0/indices/V2nTFdlwQN-berRDG4kQtQ/0/translog] is corrupted]; nested: NoSuchFileException[/data/data/nodes/0/indices/V2nTFdlwQN-berRDG4kQtQ/0/translog/translog-6537.tlog]; ], allocation_status[deciders_no]]]"
],
"node_id": "NQOmfBYLRZ-w6g4pQagm9w"
}
],
"cannot allocate because allocation is not permitted to any of the nodes that hold an in-sync shard copy",
null,
null,
true,
0,
"etwastecontainer",
"mtwastemanagement"
]
],
"rowcount": 1,
"duration": 3.479047
}
Following the recommendation in the explanation I have executed the command:
ALTER CLUSTER REROUTE RETRY FAILED
but the error is still displayed, as shown in the log.
At this point, I don’t know what else to do, do you know of a way to fix the problem? Or alternatively, I would accept losing the percentage of unavailable records if I can keep the rest of the records.
I appreciate any help you can provide.
Best regards