How big and how fast could be CrateDB In-Memory-Buffer (as high speed logger)?

Hi,

I read the tutorial [Fundamentals of eventual consistency in CrateDB] which describes the ‘Ingestion workflow’ architecture.

Wanting to avoid inserting a queue manager in front of CrateDB I would like to know if the In-Memory-Buffer can be used as a queue.

How many writing requests can CrateDB receive and manage without going under stress?

Assuming I have a table named ‘log’, very simple, consisting of two fields, ‘ts’ as TIMESTAMP and ‘pl’ as DYNAMIC OBJECT and assuming I want to write a json type data with a weight of 50K per message in ‘pl’, how can I calculate how many write requests CrateDB is able to handle per second?

Let’s assume that CrateDB would run on a VM with 16 amd64 cores, 32GB Ram, linux os and an incoming 5’000 write request per second. It is a feasible scenario with CrateDB (single instance)

Is it better to have them sent by a single client or multiple clients?

How many clients would CrateDB be able to handle?

How does the In-Memory-Buffer work? Is it manageable at the configuration level?
Is it possible to define a max_queue item numbers and how many ram allocate to the buffer ?

Thank you.

Wanting to avoid inserting a queue manager in front of CrateDB I would like to know if the In-Memory-Buffer can be used as a queue.

I don’t know what your source of data is, but this probably could be done without a 3rd party queue and with a rather simple application logic.

How many writing requests can CrateDB receive and manage without going under stress?

Difficult to answer as it depends on many factors. Typically something one would need to evaluate with simulated production load. That being said, single INSERT statements with just one record are very inefficient and I would strongly recommend to batch them together.

I want to write a json type data with a weight of 50K per message

Here it is more a question of how many sub-columns we are talking about that get indexed.

Is it better to have them sent by a single client or multiple clients?
How many clients would CrateDB be able to handle?

Doesn’t really matter. CrateDB can handle many connections in parallel without issue. Of course you would want to reuse / keep connections open to avoid the connection overhead.

How does the In-Memory-Buffer work? Is it manageable at the configuration level?

Be aware, that new data is always persisted in the WAL (Translog) on disk. The in-memory buffer is mostly there to create new Lucene segments after a table refresh and make the data queryable. Updating the index / creating a Lucene segement with every new value would be to expensive. While there are various settings to adjust it, I would advise from refrain from doing so in the beginning.