Our organization is currently tasked with migration of cluster nodes from older ubuntu to latest ones using crate version-4.8.1.
Here are the actions performed successfully.
Deployed the cluster on the newer ubuntu instances. I had to add an additional java option: ‘-XX:-UseContainerSupport’ as there is a compatbility issue between the bundled jdk version and ubuntu24 (due to the Cgroup v2 issue)
I was able to copy the older tables to the new deployed cluster using SNAPSHOT commands and run queries on them (both insert and reads)
However, the nodes are going down intermittently with CPU utilization reaching 100% as observed from the AWS metrics. Once this happens, the entire node remains unaccessible util restarted.
This issue was not present with the older cluster. As of now we are planning to keep the crateDB version same. I just wanted to understand, if the configuration specified above is sustainable.
Please let me know the logs and other details required.
thank you for writing in, and sorry that you are observing troubles when updating CrateDB. I think CrateDB 4.x is EOL, so we don’t support it any longer, but I am sure we can find a way to support you.
Nodes are going down intermittently with CPU utilization reaching 100% as observed from the AWS metrics.
I think this would need further investigations, we can’t tell much from the distance. Would any of the other CrateDB support options help you in any way? Maybe @hammerhead or @karynsaz can get you onboarded on our Jira to check what kind of special services we might be able to provide to your case?
Also, the behaviour of your cluster might sound familiar to them in one way or another, so they could come up with suggestions or even recommendations without further ado?
Please let me know if logs and other details required.
Let me also humbly defer this question to my colleagues: They know optimally how to start into a relevant troubleshooting operation. In general, if you can spot anything suspicious in your log files, it can make sense to share it so we could evaluate from the distance.
To get to the bottom of issues as you describe, it is very helpful to understand the full picture. For example, what type of data model you use (including sharding, partitioning, …), the characteristics of your workload, details of the environment your Ubuntu instances are running on, and more.
Log files can be interesting if they include any abnormal messages, but may not always include relevant details. Other metrics can help to complement the picture, such as JMX monitoring.
Depending on how much information you can share publicly, please elaborate a bit more on how you use CrateDB overall, what your tables look like (CREATE TABLE statements), your data volume, query patterns, node specifications, etc. If you find any traces of problems in log files, monitoring, etc., that would also be interesting. Usually we start investigations by reviewing monitoring metrics. The mentioned JMX metrics are important here, as they give us insights into how CrateDB performs internally (e.g. if there are any exhausted queues, memory pressure, …).
Getting a complete overview is often difficult in public. You can reach out to us through our website for an individual discussion of your use case and ways we can help. This can range from helping to set up full monitoring coverage to a joint effort upgrading your CrateDB version to a supported one.