IN clause on primary key column does not return all the records if it has more than 5k inputs

Query e.g:
SELECT * FROM table1 WHERE t1 = ‘abc’ AND id IN (551, 102, 1053, 4, 603, … n) AND t2 = 12;

Note: Above query has more than 5000 elements in the IN clause.
Above query does not return all the records.
id is my primary key column.

Observation A:
If I add 1 more IN clause to the same query then it returns all the expected records.
e.g:

  1. SELECT * FROM table1 WHERE t1 = ‘abc’ AND ( id IN (551, 102, 1053, 4, 603, … n) or id IN (551) ) AND t2 = 12;
  2. SELECT * FROM table1 WHERE t1 = ‘abc’ AND id IN (551, 102, 1053, 4, 603, … n) AND t2 in(12);

Observation B:
As per EXPLAIN query result

  1. When a query has single IN clause then it uses Dockeys
  2. When a query has multiple IN clauses then it does not use Dockeys and runs the query as it is.

Which query doesn’t use Dockeys? (I think it’s the one with or, which is expected behaviour)

When you query by primary key and all other expressions are joined with AND operator, it should use Dockeys. This means that the query will also get rows which may not be available in the table until the next REFRESH takes place.

Furthermore, what do you use to query the database? In admin UI there is an implicit LIMIT that is added. Have you tried to add a LIMIT 10000 for example at the end of your query? Alternatively you can also use a SELECT count(*) to compare the number of records returned by each of the queries you’ve been trying.

I could reproduce this problem with below steps. This could be a bug in Crate 5.8.2.
Step 1: Create below table.

create table uu.records (
id long,
uid string,
recordid long,
  rday TIMESTAMP WITH TIME ZONE,
recordname string,
recorddesc string,
primary key (rday,recordid,id )
) PARTITIONED BY (rday)
WITH (
   "translog.durability" = 'ASYNC'
);

Step 2: Load the data from attached files using below script:
copy uu.records from '/tmp/sample_records/*.json';

Step 3: Once data is imported in the table, run the queries given in attached files. Difference between two queries is additional or id in(156501)

My expectation was that I should get same data output but I see less data in one query.

Can someone confirm this as a bug in Crate 5.8.2? IMO it should also reproduce on older versions.

records_0_04732dpi6spjed9k60o30c1g.json (55.3 KB)
records_1_04732dpi6spjed9k60o30c1g.json (51.8 KB)
records_2_04732dpi6spjed9k60o30c1g.json (52.6 KB)
records_3_04732dpi6spjed9k60o30c1g.json (55.6 KB)

Query_returning_incorrect_results.txt (17.6 KB)
Query_returning_correct_results.txt (17.6 KB)

thx for reporting this issue @vinayak.shukre . It has been addressed, and the fix will be available with the upcoming 5.9.3 hotfix release.