Bad insert performance (50 values) on SELECT FROM

proddata · June 12, 2020, 2:39pm

I am not sure if this is a bug or another problem …

CREATE TABLE georg.texttest (textfield TEXT);

-- INSERT 5 million textvalues (100 distinct);
INSERT INTO georg.texttest ("textfield") values
('VALUE00'),('VALUE01'), ... , ... ,('VALUE98'),('VALUE99');

Selecting (100 out of 5 million) distinct values takes less than a second

SELECT DISTINCT textfield FROM georg.texttest;

However inserting does distinct values in a new table, takes much longer (25 sec) …

-- Create a 2nd table
CREATE TABLE georg.texttestdistinct (textfield TEXT);

-- Insert distinct values into new table
INSERT INTO georg.texttestdistinct (textfield) SELECT distinct textfield from georg.texttest;

With bigger data sets (several 100 million rows), the distinct select is still done in a few seconds, but inserting those 50 values won’t even finish after 10 minutes.

Walter_Behmann · June 13, 2020, 9:11am

I tried to re-produce this on my mighty local laptop 2018 Model.
a) start crate: cr8 run-crate latest
b) connect to the database: crash --host localhost:4200
c) create table: CREATE TABLE "bigtable" ( "wort" text )
d) generate the records from : mkjson --num 5000000 wort="oneOf(fromFile(fakewords))" | cr8 insert-json --host localhost:4200 --table bigtable
e) back in crash: `select DISTINCT wort from bigtable;

+----------+
| wort     |
+----------+
| plane    |
| been     |
| half     |
| don't    |
| year     |
| govern   |
:
:
| mean     |
| front    |
+----------+
SELECT 499 rows in set (0.285 sec)

cr> select count(*) from bigtable;
+----------+
| count(*) |
+----------+
|  5000002 |
+----------+
SELECT 1 row in set (0.001 sec)
cr> insert into smalltable (wort) select distinct(wort) from bigtable;
INSERT OK, 499 rows affected  (6.480 sec)

Your mileage may vary!

Walter_Behmann · June 14, 2020, 5:08am

So eventually there might be some insight about:

how are indexes used in a ´select distinct`?
select distinct takes 0.285 sec. and the insert takes 6.480 sec., that seems like a lot of heavy lifting going on during the insert.

cratedb 4.1.6 was used for the tests.

mfussenegger · June 15, 2020, 3:14pm

I suspect this is because the SELECT case without INSERT is getting optimized, and the later is not.

Could one of you create a Github issue out of this with the reproduction steps? This would make it easier for us to follow up on it without it getting lost.

proddata · June 15, 2020, 6:08pm

Issue created https://github.com/crate/crate/issues/10082

Topic		Replies	Views
Java Driver Issue 3rd Party Tools	1	860	December 5, 2018
Insert an array of json objects ERROR SQL	6	1539	September 23, 2021
Performance issue CrateDB	6	96	July 22, 2024
Query execution on time distinct values, does it need a full scan? CrateDB	0	544	May 21, 2020
Poor Performance: Snapshot creation takes over an hour for a 5-row table CrateDB sql	0	19	March 27, 2025

Bad insert performance (50 values) on SELECT FROM

Related topics