Hi
Does anyone know when using the http endpoint, if its more efficient for a 3 node crate cluster to accept more bulk data infrequently or smaller bulk data with more frequent insertions.
So for instance I might insert 242 rows totalling 326KB, compared to 30 rows at 3.5KB.
I just want to see if I can fine tune our ingestion, to work better with the CrateDB cluster.
In general, we would recommend testing with different batch sizes to check the performance for your specific workload. Sometimes you can even make it more flexible by using the monitoring data of your cluster to reduce/increase the batch size according to the resources available.
For your specific case, we don’t see a meaningful difference between 30 rows vs 242 rows since both are still pretty much in the same order of magnitude. In general, we would advice avoiding many small batches to prevent the repeated query parsing, planning, etc.
Depending on your use case, it is understandable to have such small batches to simulate a continues data stream by batching data that arrives in a second interval, for instance. Or even just to keep the insert computational cost low (which would be the case with both 30 or 242 rows).
Have you observed a big performance difference between the batch sizes? Feel free to share further details so we can work on a more tailored answer.
thank you for writing in. I’d like to second what Karyn said.
When reading those particular numbers, I think you should experiment with much larger batch sizes, of course also depending on the “width” of the records in your dataset.
Citing a particular spot from the document referenced by Karyn conveys the very gist in this regard:
I think using either 30 or otherwise 242 records per batch is a too small number to make any significant difference.