CrateDB nodes going down in Ubuntu 24

Hello everyone

Our organization is currently tasked with migration of cluster nodes from older ubuntu to latest ones using crate version-4.8.1.
Here are the actions performed successfully.

  1. Deployed the cluster on the newer ubuntu instances. I had to add an additional java option: ‘-XX:-UseContainerSupport’ as there is a compatbility issue between the bundled jdk version and ubuntu24 (due to the Cgroup v2 issue)

  2. I was able to copy the older tables to the new deployed cluster using SNAPSHOT commands and run queries on them (both insert and reads)

    However, the nodes are going down intermittently with CPU utilization reaching 100% as observed from the AWS metrics. Once this happens, the entire node remains unaccessible util restarted.

    This issue was not present with the older cluster. As of now we are planning to keep the crateDB version same. I just wanted to understand, if the configuration specified above is sustainable.

    Please let me know the logs and other details required.

    Thanks in advance for the help.

Hi @arnab32,

thank you for writing in, and sorry that you are observing troubles when updating CrateDB. I think CrateDB 4.x is EOL, so we don’t support it any longer, but I am sure we can find a way to support you.

Nodes are going down intermittently with CPU utilization reaching 100% as observed from the AWS metrics.

I think this would need further investigations, we can’t tell much from the distance. Would any of the other CrateDB support options help you in any way? Maybe @hammerhead or @karynsaz can get you onboarded on our Jira to check what kind of special services we might be able to provide to your case?

Also, the behaviour of your cluster might sound familiar to them in one way or another, so they could come up with suggestions or even recommendations without further ado?

Please let me know if logs and other details required.

Let me also humbly defer this question to my colleagues: They know optimally how to start into a relevant troubleshooting operation. In general, if you can spot anything suspicious in your log files, it can make sense to share it so we could evaluate from the distance.

With kind regards,
Andreas.

Hi @arnab32,

To get to the bottom of issues as you describe, it is very helpful to understand the full picture. For example, what type of data model you use (including sharding, partitioning, …), the characteristics of your workload, details of the environment your Ubuntu instances are running on, and more.

Log files can be interesting if they include any abnormal messages, but may not always include relevant details. Other metrics can help to complement the picture, such as JMX monitoring.

Depending on how much information you can share publicly, please elaborate a bit more on how you use CrateDB overall, what your tables look like (CREATE TABLE statements), your data volume, query patterns, node specifications, etc. If you find any traces of problems in log files, monitoring, etc., that would also be interesting. Usually we start investigations by reviewing monitoring metrics. The mentioned JMX metrics are important here, as they give us insights into how CrateDB performs internally (e.g. if there are any exhausted queues, memory pressure, …).

Getting a complete overview is often difficult in public. You can reach out to us through our website for an individual discussion of your use case and ways we can help. This can range from helping to set up full monitoring coverage to a joint effort upgrading your CrateDB version to a supported one.

Best
Niklas

Hi @hammerhead @amotl

Apologize for the late response as I was getting involved in multiple issues throughout the previous weeks. I would like to describe the changes which we tried and some issues identified.

Changes done:

  1. The initial assumption was that the issue is because of the older bundled JDK 17 and newer ubuntu 24 being the issue, we started the crate process with JDK 21 instead of the older bundled JDK. With that, the flag to disable the -XX:-UseContainerSupport’ was not required and crate started across all the nodes across the cluster. As per crate standards we set the HEAP size to 45 - 50% of the total system/node memory

Issues and investigation:

  1. As usual , the issue still remains with the crate process getting killed and the EC2 nodes going unreachable randomly without even any data present (maybe a test_table with 2 columns and 2 entires). The strange part is that the cluster remains stable for few days before some random node going down.

  2. After more investigation, we identified that the java process was getting killed by the OS itself:
    Out of memory: Killed process 2532357 (java) total-vm:9125676kB, anon-rss:3951192kB (3. 76 GB), file-rss:3048kB, shmem-rss:0kB, UID:1001 pgtables:10896kB oom_score_adj:0 with multiple lsof process running on the nodes leading to overall high memory usage:
    [3219759] lsof rss=744993 pages
    [3227913] lsof rss=735352 pages
    [3228850] lsof rss=735394 pages

  3. After further check, I observed there is a stark difference between the number of File Descriptors between the same 4.8.1 running on older ubuntu nodes and new ubuntu24 and the number reaching 300K in the newer nodes. I encountered the following ticket: File descriptor leaks when accessing `ExtendedOsStats` columns from synthetic `sys.nodes` table · Issue #13027 · crate/crate · GitHub related to CgroupV2 resource manager and also the fix.

Current understanding: As per our understanding , due to the infinitely growing number of FDs, the lsof process is utilizing the memory and CPU highly to go through them and the overall system memory is getting choked

Next Steps:

  1. The next step which we have planned is deploying the crateDB 5.10.x and monitoring the cluster. If the same mentioned issue is root cause for the problem, it should solve the problem. As our existing data is on crateDB 4.8.1 , we are planning to do a +1 major version upgrade where we will need to to the COPY commands to move the data across cluster

As for JMX metrics, I enabled them recently. After going through the JMX link, it could see that the beans makes sense only when we have considerable amount of data and queries being executed on the nodes. But the issue is encountered even when we are having very less data (1 table 2 records) with less to no queries executed. However, I would be happy to provide the stats if anything is required for further analysis.

Please suggest me if there are any other ways of analysing or tackling this issue.

Thanks and Regards
Arnab