Circuit Breaker Error on COLLECT_SET Aggregation - "breaker would use 18gb in total. Limit is 18gb"

Hello CrateDB Community,

I’m experiencing circuit breaker errors during aggregation queries (specifically collect operations) and would appreciate guidance on optimization strategies.


Environment:

  • Crate DB cluster: 2 nodes

  • Replicas: 0

  • Daily data volume: ~1 billion records

  • Shards per partition: 4

  • Concurrent queries: ~460 running simultaneously

  • Circuit breaker limit: 18GB

Error:

ERROR: Allocating 33kb for 'collect: 0' failed, breaker would use 18gb in total. 
Limit is 18gb. Either increase memory and limit, change the query or reduce concurrent query load

Where: org.elasticsearch.common.breaker.ChildMemoryCircuitBreaker.circuitBreak(ChildMemoryCircuitBreaker.java:99)
org.elasticsearch.common.breaker.ChildMemoryCircuitBreaker.limit(ChildMemoryCircuitBreaker.java:179)
org.elasticsearch.common.breaker.ChildMemoryCircuitBreaker.addEstimateBytesAndMaybeBreak(ChildMemoryCircuitBreaker.java:124)
io.crate.breaker.ConcurrentRamAccounting.lambda$forCircuitBreaker$0(ConcurrentRamAccounting.java:50)


Context:

  • Error occurs during COLLECT_SET aggregation operations

  • The breaker is already near its 18GB limit from concurrent operations

  • Multiple aggregation queries run concurrently on the same dataset

Questions:

  1. Memory Configuration: With 2 nodes handling ~1B records/day and ~460 concurrent queries, is 18GB circuit breaker limit appropriate? What’s the recommended sizing?

  2. Query Optimization: What are best practices for optimizing COLLECT_SET aggregations on high-cardinality data to reduce memory pressure?

  3. Concurrent Load Management: Should I:

    • Implement query throttling/batching?

    • Increase the circuit breaker limit (and by how much)?

    • Add more nodes to distribute the load?

  4. Shard Configuration: Is 4 shards per partition optimal for this volume, or should I adjust based on concurrent query patterns?

Any recommendations on balancing memory limits and concurrent aggregation workload at this scale would be appreciated.

Thank you!

Hi @varad_Deshmukh, welcome to our community!

Thanks for outlining your scenario in thorough detail.

  1. Memory Configuration: With 2 nodes handling ~1B records/day and ~460 concurrent queries, is 18GB circuit breaker limit appropriate? What’s the recommended sizing?
  2. Query Optimization: What are best practices for optimizing COLLECT_SET aggregations on high-cardinality data to reduce memory pressure?

The numbers you shared represent an average query circuit breaker usage of ~40 MB per query. If that is a high or low number largely depends on how many rows the queries read from disk, and how complex the query is overall.

To get a better feeling for this, could you please also share:

  1. A typical SELECT query and its EXPLAIN output
  2. Details on the shard sizes: SELECT partition_ident, id, num_docs, size / POWER(1024, 3) AS size_gb FROM sys.shards WHERE table_name = '<your table>' ORDER BY 1, 2;
  3. There are multiple types of circuit breakers, I assume you run into the query circuit breaker here. If you are running default values, the query circuit breaker uses 60% of the total heap size, meaning your heap is configured at 30 GB. Is that correct? How much memory do your machines have in total?
  4. What is the CrateDB version you are using?

To get more insights into how much memory your queries consume, you can also check sys.operations, specifically the used_bytes column. See this example in our documentation. Does that confirm a ~40 MB usage per query? We also need to keep in mind that INSERT statements consume heap space (and therefore circuit breaker memory) as well, so not all the heap space is going to be available for SELECT queries.

Best
Niklas