Selecting (100 out of 5 million) distinct values takes less than a second
SELECT DISTINCT textfield FROM georg.texttest;
However inserting does distinct values in a new table, takes much longer (25 sec) …
-- Create a 2nd table
CREATE TABLE georg.texttestdistinct (textfield TEXT);
-- Insert distinct values into new table
INSERT INTO georg.texttestdistinct (textfield) SELECT distinct textfield from georg.texttest;
With bigger data sets (several 100 million rows), the distinct select is still done in a few seconds, but inserting those 50 values won’t even finish after 10 minutes.
I tried to re-produce this on my mighty local laptop 2018 Model.
a) start crate: cr8 run-crate latest
b) connect to the database: crash --host localhost:4200
c) create table: CREATE TABLE "bigtable" ( "wort" text )
d) generate the records from : mkjson --num 5000000 wort="oneOf(fromFile(fakewords))" | cr8 insert-json --host localhost:4200 --table bigtable
e) back in crash: `select DISTINCT wort from bigtable;
+----------+
| wort |
+----------+
| plane |
| been |
| half |
| don't |
| year |
| govern |
:
:
| mean |
| front |
+----------+
SELECT 499 rows in set (0.285 sec)
cr> select count(*) from bigtable;
+----------+
| count(*) |
+----------+
| 5000002 |
+----------+
SELECT 1 row in set (0.001 sec)
cr> insert into smalltable (wort) select distinct(wort) from bigtable;
INSERT OK, 499 rows affected (6.480 sec)
I suspect this is because the SELECT case without INSERT is getting optimized, and the later is not.
Could one of you create a Github issue out of this with the reproduction steps? This would make it easier for us to follow up on it without it getting lost.