Another Indexes question

djbestenergy · January 10, 2023, 2:50pm

Hello all,

This is a follow on related question to my original question about indexing.

Is there a possibility that I could store REALs and DOUBLE PRECISIONs as chars ( fixed length ) , so the indexes could be removed and then in SQL typecast them back for any calculations ?

I know I could use an OBJECT(IGNORED) to remove indexes for the numeric fields, but Its a bit more work programmatically to change what we already have , if I can just use typecasting.

Many thanks
David.

proddata · January 10, 2023, 4:53pm

You could store them as TEXT and disable both the index and the column store. This is in many ways very similar to OBJECT(IGNORED)

num_field TEXT INDEX OFF STORAGE WITH (columnstore = false)

in some cases implicit casts might just be enough

cr> CREATE TABLE ta01 (num_field TEXT INDEX OFF STORAGE WITH (columnstore = false));                    
CREATE OK, 1 row affected  (0.475 sec)

cr> INSERT INTO ta01 VALUES (1.2);                                                                      
INSERT OK, 1 row affected  (0.101 sec)

cr> SELECT num_field + 1.0 imp FROM ta01;                                                                                                            
+-----+
| imp |
+-----+
| 2.2 |
+-----+

however indexes can also be removed from numeric fields

double_field DOUBLE INDEX OFF,
int_field INT INDEX OFF

djbestenergy · January 10, 2023, 5:04pm

I see - I misread the part about the columnstore and thought it was indexes that could not be removed from numerics fields.
Does having columnstore against a field also take up space ?
( in which case I might do the above with TEXT suggestion ).

Many thanks.

karynsaz · January 10, 2023, 5:13pm

Hi David!

In addition to Georg’s comment, if you compare the 4 tables below:

CREATE TABLE test_text (
  num_char TEXT INDEX OFF STORAGE WITH (columnstore = FALSE)
);  
CREATE TABLE test_text_col (
  num_char TEXT INDEX OFF
); 
CREATE TABLE test_real_off (
  num_real REAL INDEX OFF
); 
CREATE TABLE test_real (
  num_real REAL
);

Each loaded with 1 million records, the storage will be as follows:

test_text = 23.6MiB
test_text_col = 25.1MiB
test_real_off = 31.8MiB
test_real = 37.0MiB

Important to notice that, in the case of the test_text table, in which the field num_char has no index nor it has the columnstore, you would need to explicitly cast the field in case you want to use the WHERE clause.

djbestenergy · January 10, 2023, 5:21pm

Thanks, that’s very helpful!

Just out of interest, was the same data put into each (or at least similar sizes ) as the difference in test_text_col and test_real_off is quite surprising.

I’d not be needing the fields in a WHERE clause but I’d be performing AVG and WINDOW functions on the fields only.

karynsaz · January 10, 2023, 8:05pm

That’s actually a great question. To perform the insertion of data I used cr8, which counts with an insert-fake-data function.
As far as I could investigate, for the text type, the routine inserts one-word texts, while for the number type, it is more random with bigger numbers. That can partially explain the big difference there.

proddata · January 11, 2023, 3:49pm

Hi @djbestenergy

A little more context with a little more tests (inserting 1 Mio records in different table schemas):

General note for size comparisons it is good practice to limit the shards to 1 and run an optimize fully merging segments

create table float(num FLOAT) 
create table float_io(num FLOAT INDEX OFF)
create table double(num DOUBLE)
create table double_io(num DOUBLE INDEX OFF)
create table text_float(num TEXT)
create table text_float_io(num TEXT INDEX OFF)
create table text_float_io_co(num TEXT INDEX OFF STORAGE WITH (columnstore = FALSE))
create table text_double(num TEXT)
create table text_double_io(num TEXT INDEX OFF) 
create table text_double_io_co(num TEXT INDEX OFF STORAGE WITH (columnstore = FALSE))

test_size.sql

set search_path to 'size_test';

drop table if exists float;
drop table if exists double;
drop table if exists float_io;
drop table if exists double_io;
drop table if exists text_float;
drop table if exists text_float_io;
drop table if exists text_float_io_co;
drop table if exists text_double;
drop table if exists text_double_io;
drop table if exists text_double_io_co;

create table float(num FLOAT) clustered into 1 shards with ("number_of_replicas" = 0);
create table float_io(num FLOAT INDEX OFF) clustered into 1 shards with ("number_of_replicas" = 0);
create table double(num DOUBLE) clustered into 1 shards with ("number_of_replicas" = 0);
create table double_io(num DOUBLE INDEX OFF) clustered into 1 shards with ("number_of_replicas" = 0);
create table text_float(num TEXT) clustered into 1 shards with ("number_of_replicas" = 0);
create table text_float_io(num TEXT INDEX OFF) clustered into 1 shards with ("number_of_replicas" = 0);
create table text_float_io_co(num TEXT INDEX OFF STORAGE WITH (columnstore = FALSE)) clustered into 1 shards with ("number_of_replicas" = 0);
create table text_double(num TEXT) clustered into 1 shards with ("number_of_replicas" = 0);
create table text_double_io(num TEXT INDEX OFF) clustered into 1 shards with ("number_of_replicas" = 0);
create table text_double_io_co(num TEXT INDEX OFF STORAGE WITH (columnstore = FALSE)) clustered into 1 shards with ("number_of_replicas" = 0);

insert into double
select random() from generate_series(1,1000000,1);
refresh table double;

insert into float select num::FLOAT from double;
insert into float_io select num::FLOAT from double;
insert into double_io select * from double;
insert into text_float select num::FLOAT from double;
insert into text_float_io select num::FLOAT from double;
insert into text_float_io_co select num::FLOAT from double;
insert into text_double select num from double;
insert into text_double_io select num from double;
insert into text_double_io_co select num from double;

refresh table float;
refresh table float_io;
refresh table double;
refresh table double_io;
refresh table text_float;
refresh table text_float_io;
refresh table text_float_io_co;
refresh table text_double;
refresh table text_double_io;
refresh table text_double_io_co;

optimize table float WITH (max_num_segments = 1);
optimize table float_io WITH (max_num_segments = 1);
optimize table double WITH (max_num_segments = 1);
optimize table double_io WITH (max_num_segments = 1);
optimize table text_float WITH (max_num_segments = 1);
optimize table text_float_io WITH (max_num_segments = 1);
optimize table text_float_io_co WITH (max_num_segments = 1);
optimize table text_double WITH (max_num_segments = 1);
optimize table text_double_io WITH (max_num_segments = 1);
optimize table text_double_io_co WITH (max_num_segments = 1);

select table_name, round(size / POWER(1024,2))::TEXT || ' MiB' as size
from sys.segments where primary and table_schema = 'size_test' order by 1;

can be run locally by putting it in a file and execute it with crash

crash < test_size.sql

Results:

+-------------------+--------+
| table_name        | size   |
+-------------------+--------+
| double            | 49 MiB |
| double_io         | 37 MiB |
| text_double       | 59 MiB |
| text_double_io    | 46 MiB |
| text_double_io_co | 31 MiB |
+-------------------+--------+
| float             | 31 MiB |
| float_io          | 26 MiB |
| text_float        | 37 MiB |
| text_float_io     | 29 MiB |
| text_float_io_co  | 23 MiB |
+-------------------+--------+

-- io -> index off
-- co -> column store off

i.e the storage savings from storing floating point values as text is only ~ 15%, but with significant performance impact, especially for aggregations:

echo "SELECT avg(num), count(num) from size_test.float;" | cr8 timeit --hosts localhost:4200
Runtime (in ms):
    mean:    30.928 ± 0.148
    min/max: 30.006 → 34.941

% echo "SELECT avg(num), count(num) from size_test.float_io;" | cr8 timeit --hosts localhost:4200
Runtime (in ms):
    mean:    31.006 ± 0.150
    min/max: 29.812 → 33.890

% echo "SELECT avg(num::float), count(num) from size_test.text_float;" | cr8 timeit --hosts localhost:4200
Runtime (in ms):
    mean:    526.880 ± 8.488
    min/max: 475.093 → 715.010

% echo "SELECT avg(num::float), count(num) from size_test.text_float_io;" | cr8 timeit --hosts localhost:4200
Runtime (in ms):
    mean:    554.985 ± 12.245
    min/max: 474.457 → 789.776

% echo "SELECT avg(num::float), count(num) from size_test.text_float_io_co;" | cr8 timeit --hosts localhost:4200;
Runtime (in ms):
    mean:    9963.003 ± 101.061
    min/max: 9122.198 → 11976.021

float_io vs text_float_io_co
→ 321x

That being said, this test was done with completely random data. encoding/compression might work different with real-word data, that typically shows less randomness.

djbestenergy · January 11, 2023, 5:35pm

Thanks @proddata that is really useful and shows how important the type is.

proddata · January 13, 2023, 7:45am

One last addition:

changing the table compression to best_compression

i.e.

create table float(num FLOAT)
clustered into 1 shards
with ("number_of_replicas" = 0, "codec" = 'best_compression');

+-------------------+--------+
| table_name        | size   |
+-------------------+--------+
| double            | 40 MiB |
| double_io         | 29 MiB |
| text_double       | 50 MiB |
| text_double_io    | 37 MiB |
| text_double_io_co | 23 MiB |
+-------------------+--------+
| float             | 25 MiB |
| float_io          | 21 MiB |
| text_float        | 31 MiB |
| text_float_io     | 24 MiB |
| text_float_io_co  | 18 MiB |
+-------------------+--------+

-- differences in time to the above results due to different system setups

% echo "SELECT avg(num), count(num) from size_test.float;" | cr8 timeit --hosts localhost:4200
Runtime (in ms):
    mean:    19.131 ± 0.121
    min/max: 18.501 → 23.190

% echo "SELECT avg(num), count(num) from size_test.float_io;" | cr8 timeit --hosts localhost:4200
Runtime (in ms):
    mean:    19.348 ± 0.165
    min/max: 18.544 → 24.255

% echo "SELECT avg(num::float), count(num) from size_test.text_float;" | cr8 timeit --hosts localhost:4200
Runtime (in ms):
    mean:    460.077 ± 1.895
    min/max: 449.610 → 507.642

 % echo "SELECT avg(num::float), count(num) from size_test.text_float_io;" | cr8 timeit --hosts localhost:4200
Runtime (in ms):
    mean:    459.306 ± 1.153
    min/max: 451.709 → 485.768

% echo "SELECT avg(num::float), count(num) from size_test.text_float_io_co;" | cr8 timeit --hosts localhost:4200
Runtime (in ms):
    mean:    32687.997 ± 100.736
    min/max: 31582.404 → 34792.086

Setting the table compression should not have an impact on aggregation performance, but makes table scans - which basically happens with deactivated column stores - quite a bit more expensive. Also best_compression has an impact on ingest speed.

proddata · December 18, 2023, 8:32am

Updated results with 5.5.1 and ability to disable column_store on numeric types:

+-------------------+-------------------+-------------------+
| table_name        | default           | best_compression  |
+-------------------+-------------------+-------------------+
| double            | 50 MiB            | 41 MiB            |
| double_io         | 39 MiB            | 30 MiB            |
| double_io_co      | 31 MiB            | 23 MiB            |
+-------------------+-------------------+-------------------+
| float             | 33 MiB            | 25 MiB            |
| float_io          | 26 MiB            | 21 MiB            |
| float_io_co       | 23 MiB            | 18 MiB            |
+-------------------+-------------------+-------------------+
| text_double       | 59 MiB            | 50 MiB            |
| text_double_io    | 45 MiB            | 37 MiB            |
| text_double_io_co | 31 MiB            | 23 MiB            |
+-------------------+-------------------+-------------------+
| text_float        | 37 MiB            | 31 MiB            |
| text_float_io     | 29 MiB            | 24 MiB            |
| text_float_io_co  | 23 MiB            | 18 MiB            |
+-------------------+-------------------+-------------------+

Topic		Replies	Views
Indexes and storage space CrateDB	5	692	December 15, 2022
Rows need significantly more storage as others in the same table and pretty much the same content SQL	10	778	December 17, 2021
Space savings on data CrateDB	4	760	January 26, 2023
SQL - cast to class java.lang.String error SQL	15	870	October 19, 2022
Incorrect result with TRY_CAST function CrateDB	6	606	February 18, 2022

Another Indexes question

Related topics