Map failed Exception resulting in node crash

Hi Team,

I have upgraded to crate 5.10.9 last week.
my application with crate configuration was working fine alomost for a week.
Suddenly started getting exception mentioned below. With this memory exception, it took my node down.

Exception:

INFO: Using MemorySegmentIndexInput and native madvise support with Java 21 or later; to disable start with  
-Dorg.apache.lucene.store.MMapDirectory.enableMemorySegments=false 

 master node changed
[partitioned.transactions.04732dpk70ojidpo60o30c1g][7] marking and sending shard failed due to [failed recovery] 

Caused by: java.io.IOException: Map failed: MemorySegmentIndexInput(path="/data1/crate/nodes/0/indices/zzBD9NTASEC1xlAhPjf6sw/7/index/_3c80_CrateDBLucene90_66.dvd") [this may be caused by lack of enough unfragmented virtual address space or too restrictive virtual memory limits enforced by the operating system, preventing us to map a chunk of 1015301 bytes. Please review 'ulimit -v', 'ulimit -m' (both should return 'unlimited'), and 'sysctl vm.max_map_count'. 


OpenJDK 64-Bit Server VM warning: INFO: os::commit_memory(0x00007e1b410d6000, 16384, 0) failed; error='Not enough space' (errno=12)
There is insufficient memory for the Java Runtime Environment to continue.

My cluster setup :
Heap_size = 30GB as recommended
Number of instances = 5
Total records = app. 9 billion
number of shards not more than 1000

After trying multiple options to get rid of it, purging old data only helped me to get my cluster up and running.

Did anyone else have faced similar issue?
What can be the possible solution for same?

Hi, Did you try to review:

‘ulimit -v’, ‘ulimit -m’ (both should return ‘unlimited’), and ‘sysctl vm.max_map_count’.

yes, I tried these options used values to 200000 and not unlimited.
also max_map_count = 200000
still didn’t work.

If I read this right, you tried 200k and not unlimited?

I would recommend to

double max_map_count

If you have 100k, double it to 200k, if still the issue persist, double it again to 400k, until 1M, otherwise set it to unlimited.

Additionally just in case I would recommend to

Check your shard count

We recommend the shard count not to exceed 1000 shards, you say its no more than 1000, but that could also be 999 or 1000, I would try to add one more node and see if the issue persist.

Check shard size

We commend shard sizes to be 5-50GiB, you can check your shard size with:

SELECT
  (size / (1024.0 ^ 3)) AS size_in_gib
FROM
  sys.shards
ORDER BY
  size_in_gib DESC