I’m using crate since a couple of months now. We have a Kakfa Cluster where IoT data is queued in a topic. We then utilize Erlang as a stream processor to transform these basically into INSERT statements we post to the Crate HTTP Endpoint where 3 crate 4.x nodes are doing their work. All this is deployed on Azure Kubernernetes Service. The Crate Node have as ressource limits 2CPUs each, 3 GB RAM each and 2G Heap. I’m aware that ressources might be too less for our use case, but at normal load all this works pretty well.
We are planning to move to a much bigger cluster and therefore crate gets more ressources, but first I want to make sure that I have a basic understanding what is going wrong when there is a load peak.
When our Kakfa consumers got a lot of lag, and erlang processes are posting to crate continously
(approx. a rate of 200 INSERTs per second) the whole system gets really slow and unstable.
The load in the crate UI doesn’t really hit the limits (Processor load 20 to 50%, Heap max. 70%). Nevertheless every now and then Crate goes into recovery, suddenly a lot of things turn red and number of rows is 0 (which makes me REALLY nervous)…after a while it seems to recover but there where also moments where it was frozen in this state and I had to RECOVER from an day old SNAPSHOT.
I also tried to analyse sys.jobs where there are up to 15.000 (!) Jobs in these high load times, but not very much related sys.processes, so I think those are zombies.
Long story short: I really feel very unsure on how to tackle these cases best. Just throw more ressources on my cluster or am I missing some basic tweaks? The most important thing for me isn’t to make crate handle the most load imaginable but I don’t want to lose data which is already in my database, and those unstable recovery processes just look like gambling for me.