There is some limit to send the data through ‘in’ clause in Crate query. How to handle this problem when I have millions of records. Whats is the limit to send data in ‘in’ clause ?
for ex. select * from tablename where employeeid in (1,2,3,…,n);
what is the limit for n’th value in query?
knowing a bit more about the context of the query would be useful here. How is n determined, e.g. is it passed dynamically from a client application? Are the queried employee IDs consecutive or are there gaps between 1 and n? If they are consecutive, a range query (
WHERE employeeid BETWEEN x AND y) should be much more performant and easier to handle.
I would always try to avoid significantly large
IN queries, as it increases complexity at various levels (query parsing, runtime performance, large
sys.jobs_log entries, etc).
There is a limit (
indices.query.bool.max_clause_count) for certain array-related operations. Did you run into any specific limit/error message?
Hi @hammerhead ,
Thanks for the help , we are passing values for IN clause dynamically from client application , and observed that size of these values can be pretty huge , we have successfully executed IN clause with upto 10k values (10k*15 chars total) but looking at requirement it may cross few millions , as these are String and random values we can not use BETWEEN operator.
If the user have a millions of record & they use IN clause in their query, then this query could be raise any issue if there is any limit for IN clause
the employee IDs passed from the client application are completely random, e.g. there is no additional attribute that could be used for filtering, like a department ID or such? Not even for an approximate selection?
I would recommend benchmarking
IN clauses with the cardinality of employee IDs that you need and verifying you still get good performance. I’m less concerned that CrateDB can’t quickly obtain the rows, but rather about the sheer textual size of the resulting SQL query. Independent of CrateDB-specifics I think a very large
IN clause adds significant overhead in compiling the query, transmitting it, parsing it, and so on.
It might be an option to partition the query, meaning not running one huge
IN clause but running several queries with each a subset of the
IN clause (depending on the results of your benchmarking).
What type of client application are we talking about? If millions of rows are returned, I assume this is not a directly user-facing application, but some sort of automated further processing? It might even be worth considering to do an approximate selection of rows in CrateDB and delegate parts of the filtering to the client application. Like obtaining all employees from let’s say a whole department/location/etc (even if that returns slightly too many rows) and doing some fine-grained filtering for specific employee IDs in the client application.