Problems on high INSERT load

I’m using crate since a couple of months now. We have a Kakfa Cluster where IoT data is queued in a topic. We then utilize Erlang as a stream processor to transform these basically into INSERT statements we post to the Crate HTTP Endpoint where 3 crate 4.x nodes are doing their work. All this is deployed on Azure Kubernernetes Service. The Crate Node have as ressource limits 2CPUs each, 3 GB RAM each and 2G Heap. I’m aware that ressources might be too less for our use case, but at normal load all this works pretty well.

We are planning to move to a much bigger cluster and therefore crate gets more ressources, but first I want to make sure that I have a basic understanding what is going wrong when there is a load peak.

When our Kakfa consumers got a lot of lag, and erlang processes are posting to crate continously
(approx. a rate of 200 INSERTs per second) the whole system gets really slow and unstable.

The load in the crate UI doesn’t really hit the limits (Processor load 20 to 50%, Heap max. 70%). Nevertheless every now and then Crate goes into recovery, suddenly a lot of things turn red and number of rows is 0 (which makes me REALLY nervous)…after a while it seems to recover but there where also moments where it was frozen in this state and I had to RECOVER from an day old SNAPSHOT.

I also tried to analyse sys.jobs where there are up to 15.000 (!) Jobs in these high load times, but not very much related sys.processes, so I think those are zombies.

Long story short: I really feel very unsure on how to tackle these cases best. Just throw more ressources on my cluster or am I missing some basic tweaks? The most important thing for me isn’t to make crate handle the most load imaginable but I don’t want to lose data which is already in my database, and those unstable recovery processes just look like gambling for me.

Hi Jürgen,

To be fair, the resources are quite limited. Though also 200 Inserts/s aren’t that much.

  • To you have more information on the 15k jobs, that you saw int he high load times?
  • Could you maybe also share log data from the mentioned events?
  • Have you set bootstrap.memory_lock to true?

Hi!

I have to add, that I do not directly maintain the INSERT Statements, as we are adopting FiWARE and using their “QuantumLeap” (Timeseries persistence API) which uses crate as backend. AFAIK, they really do one INSERT for each dataset (over the python client https://github.com/smartsdk/ngsi-timeseries-api/blob/master/src/translators/crate.py). We face MUCH lesser problems when I build the INSERT Stmt. in Erlang on my own and use the bulk_args to INSERT a couple of dataset at once over the HTTP API. Unfortunately doing it the FiWARE way is a must in this project.

So the 15k jobs are simply (asynchronous?) INSERTs to one of two destination tables. Those INSERTs are aggregating over time when I start our heavy load test. Without knowing how it is working in the background, I assume Erlang is feeding the pipeline fast producing a lot of backpressure in crate which is switching to recovery mode when a certain amount of jobs is reached. Perhaps implementing some kind of throttling would help here.

Which kind of logs? The PODs/StatefulSet Logs? or certain sys.jobs_log?

No, I assume I have to set bootstrap.memory_lock as an parameter for docker_entrypoint.sh on startup time?