Hello CrateDB Community,
I’m experiencing circuit breaker errors during aggregation queries (specifically collect operations) and would appreciate guidance on optimization strategies.
Environment:
-
Crate DB cluster: 2 nodes
-
Replicas: 0
-
Daily data volume: ~1 billion records
-
Shards per partition: 4
-
Concurrent queries: ~460 running simultaneously
-
Circuit breaker limit: 18GB
Error:
ERROR: Allocating 33kb for 'collect: 0' failed, breaker would use 18gb in total.
Limit is 18gb. Either increase memory and limit, change the query or reduce concurrent query load
Where: org.elasticsearch.common.breaker.ChildMemoryCircuitBreaker.circuitBreak(ChildMemoryCircuitBreaker.java:99)
org.elasticsearch.common.breaker.ChildMemoryCircuitBreaker.limit(ChildMemoryCircuitBreaker.java:179)
org.elasticsearch.common.breaker.ChildMemoryCircuitBreaker.addEstimateBytesAndMaybeBreak(ChildMemoryCircuitBreaker.java:124)
io.crate.breaker.ConcurrentRamAccounting.lambda$forCircuitBreaker$0(ConcurrentRamAccounting.java:50)
Context:
-
Error occurs during
COLLECT_SETaggregation operations -
The breaker is already near its 18GB limit from concurrent operations
-
Multiple aggregation queries run concurrently on the same dataset
Questions:
-
Memory Configuration: With 2 nodes handling ~1B records/day and ~460 concurrent queries, is 18GB circuit breaker limit appropriate? What’s the recommended sizing?
-
Query Optimization: What are best practices for optimizing
COLLECT_SETaggregations on high-cardinality data to reduce memory pressure? -
Concurrent Load Management: Should I:
-
Implement query throttling/batching?
-
Increase the circuit breaker limit (and by how much)?
-
Add more nodes to distribute the load?
-
-
Shard Configuration: Is 4 shards per partition optimal for this volume, or should I adjust based on concurrent query patterns?
Any recommendations on balancing memory limits and concurrent aggregation workload at this scale would be appreciated.
Thank you!