Oh my fucking god! After all that effort, I finally managed to track down the problem. It’s not CrateDB, it’s not my software—it’s the damn load balancer in between (relayD)!!!
The relayD was configured for load balancing and, as I’ve concluded, it was doing this at the TCP packet level, or something like that. So, HTTP requests to the CrateDB cluster were getting split up and not all sent to the same cluster node. That’s how all these bizarre issues arose—facepalm.
I’ve now extended my client with connection pooling and implemented the distribution to the cluster nodes myself on the client side (random, round-robin, load). Now I have zero problems! The question I’m left with is: Should I maybe spin up an additional CrateDB node that only handles connections and doesn’t provide storage, to better connect to the cluster? Would that make things faster/better and utilize the cluster more efficiently? Or is my client-side pooling implementation the better approach?
I really want to express my heartfelt gratitude to everyone involved in this topic. The fault was entirely on my side, and all I received from the community was constructive and friendly support! Thank you so much!!!
thanks for your report. It’s good to have you back with a working system, and that you have been able to identify the root cause of your data soup.
Why and how the relayD load balancer messed things up is beyond my imagination, but it’s good you have been able to rule out CrateDB and other system parts.
Excellent.
I think it’s not necessary if your system works well now. However, let me kindly defer this question to @hammerhead or @smu to possibly respond more fundamentally.
there is the option to add dedicated query-handler nodes. These nodes handle incoming connections as well as query coordination (parsing, planning, final aggregation, etc.), while the actual shard-level execution still happens on the data nodes.
Dedicated query nodes can be beneficial in larger or high-traffic clusters where query coordination or connection handling becomes a bottleneck. They allow you to separate concerns more clearly and scale query handling independently from indexing and shard-level data retrieval. But they also add complexity and require careful sizing. High availability for example will require multiple of those query handler nodes to prevent a single point of failure.
Load on the remaining data nodes would consequently reduce, as they don’t participate in connection handling and query-coordination anymore. You likely will end up with more resources overall if you add additional query handler nodes. Since you can’t perfectly separate query coordination and data processing from a sizing perspective, you’ll need to provision headroom on both types of nodes. This typically leads to higher overall hardware usage compared to a setup where all nodes share both responsibilities.
Unless you’re already seeing clear signs that connection handling or query coordination is a performance bottleneck, I would recommend sticking to client-side pooling or using a different load balancer solution.