thanks for your report. Is the behavior reproducible in a way that reissuing both statements without LIMIT clauses once more will also yield a long query duration?
The reason for the initial (longer) duration might be some initialization overhead. May I ask what hardware and environment you are running CrateDB on and how your configuration looks like? Having enough memory available for CrateDB is crucial for reasonable operation.
That the first query took 30 seconds for answering might indicate something into the direction that the environment CrateDB is running on needs some time to get ready. It might also indicate that the number of configured shards is too high for the fair amount of (~100000?) records as this would definitively induce a significant overhead.
Hi Andreas, it seems after re-importing my 100 000 Rows into my two different schemas, performances are now the same for both.
However, even after re-importing, the SELECT * from users (containing 100 001 rows) and SELECT * from users LIMIT 1000000 (which returns the same result because i have only 100 001 rows) still does a different thing (done multiple times) :
In CrateDB WebAdmin UI Console, i can’t reproduce due to LIMIT security (but with a LIMIT, i have the same duration results as in my CrateDBAdmin .Net client).
So, in conclusion … we would count() results before systematically, and re-request with the Count() result in the LIMIT clause …
Something is boosting requests with LIMIT clause, even if the limit is over the number of resulted rows.
This behaviour is totally hidden in Web Admin UI Console due to LIMIT security clause (but in custom applications … it may not be the case …)
I’m’ using CrateDB 4.5 in a Windows Test Machine (Core I7, 16GB RAM) with nothing running on it …
But a good info : on the same machine, i have switched for the CrateDB 4.4.2 and 4.1.4 and i can’t reproduce the problem (with same tables and data and config… only one node)…
You can reproduce that behaviour on version 4.5 on your side ?
This is the thing I would like to follow up on and dig deeper.
So, according to your observations, you believe it happened somewhere in between 4.4.2 and 4.5.0? There is one other release before 4.5.0 happened, namely 4.4.3. Maybe you can also check that one in order to narrow down your observations further?
All right, so the observed performance regression on the with or without LIMIT clause thing might have slipped in between 4.4.2 and 4.4.3, right?
All right, so the observed performance regression on the with a WHERE clause only returning one record, even adding a LIMIT 1 will gain more performance thing only happened between 4.4.3 and 4.5.0, right?
I have installed the 4.5 version on a Linux Debian Stretch on a dedicated server. I have the same behaviour on LIMIT Clauses but not on the WHERE + LIMIT 1 usecase (i think we can forget it as it does not have the same behaviour on linux (which is OK) and windows (not OK)).
Another info, my table “users” have a column “id” which is a PRIMARY KEY. Table rows have been initially imported (by JSON files) on a disordered manner (thanks to multi-threading ! :p)
So if i do a SELECT * from users ORDER BY id , performances are now equal to the same request with a LIMIT 1000000 (but without the ORDER BY) …
I can’t see the link between the ORDER BY [primary key] and the LIMIT clause (which should takes place after the query results) but optimization has certainly multiple paths (and passes :p) :
Finally… i don’t know what to think about all of that