Space savings on data

djbestenergy · January 26, 2023, 11:42am

Hello all,
We’ve done quite a bit or rationalization on our data by changing the compression and removing indexes and changing fields sizes down, is there any potential space saving that we might have missed ?.

I know its quite subjective topic, but I’m wondering if storing data in arrays might help ?

The schema we have is below, with indexes turned off on columns we can afford to.

CREATE TABLE IF NOT EXISTS "doc"."safe_1526595589" (
   "ts" TIMESTAMP WITH TIME ZONE,
   "meta_8" INTEGER,
   "a12" REAL INDEX OFF,
   "a13" REAL INDEX OFF,
   "f" REAL INDEX OFF,
   "In" DOUBLE PRECISION INDEX OFF,
   "t" REAL,
   "system_6" REAL,
   "system_7" REAL,
   "system_8" REAL,
   "e" DOUBLE PRECISION INDEX OFF,
   "ex" DOUBLE PRECISION INDEX OFF,
   "re" DOUBLE PRECISION INDEX OFF,
   "rex" DOUBLE PRECISION INDEX OFF,
   "ae" DOUBLE PRECISION INDEX OFF,
   "e1" DOUBLE PRECISION INDEX OFF,
   "ex1" DOUBLE PRECISION INDEX OFF,
   "re1" DOUBLE PRECISION INDEX OFF,
   "rex1" DOUBLE PRECISION INDEX OFF,
   "ae1" DOUBLE PRECISION INDEX OFF,
   "e2" DOUBLE PRECISION INDEX OFF,
   "ex2" DOUBLE PRECISION INDEX OFF,
   "re2" DOUBLE PRECISION INDEX OFF,
   "rex2" DOUBLE PRECISION INDEX OFF,
   "ae2" DOUBLE PRECISION INDEX OFF,
   "e3" DOUBLE PRECISION INDEX OFF,
   "ex3" DOUBLE PRECISION INDEX OFF,
   "re3" DOUBLE PRECISION INDEX OFF,
   "rex3" DOUBLE PRECISION INDEX OFF,
   "ae3" DOUBLE PRECISION INDEX OFF,
   "pulse" INTEGER,
   "counter_22" REAL,
   "counter_23" REAL,
   "counter_24" REAL,
   "v" REAL INDEX OFF,
   "u" REAL INDEX OFF,
   "i" REAL INDEX OFF,
   "p" REAL INDEX OFF,
   "q" REAL INDEX OFF,
   "s" REAL INDEX OFF,
   "pf" REAL INDEX OFF,
   "d" REAL INDEX OFF,
   "v1" REAL INDEX OFF,
   "u1" REAL INDEX OFF,
   "i1" REAL INDEX OFF,
   "p1" REAL INDEX OFF,
   "q1" REAL INDEX OFF,
   "s1" REAL INDEX OFF,
   "pf1" REAL INDEX OFF,
   "d1" REAL INDEX OFF,
   "v2" REAL INDEX OFF,
   "u2" REAL INDEX OFF,
   "i2" REAL INDEX OFF,
   "p2" REAL INDEX OFF,
   "q2" REAL INDEX OFF,
   "s2" REAL INDEX OFF,
   "pf2" REAL INDEX OFF,
   "d2" REAL INDEX OFF,
   "v3" REAL INDEX OFF,
   "u3" REAL INDEX OFF,
   "i3" REAL INDEX OFF,
   "p3" REAL INDEX OFF,
   "q3" REAL INDEX OFF,
   "s3" REAL INDEX OFF,
   "pf3" REAL INDEX OFF,
   "d3" REAL INDEX OFF,
   "vpeak" REAL INDEX OFF,
   "upeak" REAL INDEX OFF,
   "ipeak" REAL INDEX OFF,
   "ppeak" REAL INDEX OFF,
   "qpeak" REAL INDEX OFF,
   "speak" REAL INDEX OFF,
   "pfpeak" REAL INDEX OFF,
   "dmax" REAL INDEX OFF,
   "vpeak1" REAL INDEX OFF,
   "upeak1" REAL INDEX OFF,
   "ipeak1" REAL INDEX OFF,
   "ppeak1" REAL INDEX OFF,
   "qpeak1" REAL INDEX OFF,
   "speak1" REAL INDEX OFF,
   "pfpeak1" REAL INDEX OFF,
   "dmax1" REAL INDEX OFF,
   "vpeak2" REAL INDEX OFF,
   "upeak2" REAL INDEX OFF,
   "ipeak2" REAL INDEX OFF,
   "ppeak2" REAL INDEX OFF,
   "qpeak2" REAL INDEX OFF,
   "speak2" REAL INDEX OFF,
   "pfpeak2" REAL INDEX OFF,
   "dmax2" REAL INDEX OFF,
   "vpeak3" REAL INDEX OFF,
   "upeak3" REAL INDEX OFF,
   "ipeak3" REAL INDEX OFF,
   "ppeak3" REAL INDEX OFF,
   "qpeak3" REAL INDEX OFF,
   "speak3" REAL INDEX OFF,
   "pfpeak3" REAL INDEX OFF,
   "dmax3" REAL INDEX OFF,
   "id" VARCHAR(40) NOT NULL,
   "roundts" TIMESTAMP WITH TIME ZONE NOT NULL,
   "roundts_month" TIMESTAMP WITH TIME ZONE GENERATED ALWAYS AS date_bin(CAST('P2W' AS interval), "roundts", CAST(1672531200000 AS bigint)) NOT NULL,
   PRIMARY KEY ("roundts", "roundts_month", "id")
)
CLUSTERED BY ("id") INTO 3 SHARDS
PARTITIONED BY ("roundts_month")
WITH (
   "allocation.max_retries" = 5,
   "blocks.metadata" = false,
   "blocks.read" = false,
   "blocks.read_only" = false,
   "blocks.read_only_allow_delete" = false,
   "blocks.write" = false,
   codec = 'best_compression',
   column_policy = 'dynamic',
   "mapping.total_fields.limit" = 1000,
   max_ngram_diff = 1,
   max_shingle_diff = 3,
   number_of_replicas = '0-1',
   "routing.allocation.enable" = 'all',
   "routing.allocation.total_shards_per_node" = -1,
   "store.type" = 'fs',
   "translog.durability" = 'REQUEST',
   "translog.flush_threshold_size" = 536870912,
   "translog.sync_interval" = 5000,
   "unassigned.node_left.delayed_timeout" = 60000,
   "write.wait_for_active_shards" = '1'
)

Many thanks
David

hernanc · January 26, 2023, 11:56am

Hi David,

I think we discussed the PK in the past, perhaps roundts_month could be removed from it?

About using ARRAYs, if you have lots of NULLs at the moment and an ARRAY would replace multiple columns, yes, I think that would bring some space savings but you would need to consider how the queries would need to change and test how they perform.

I hope that helps.

djbestenergy · January 26, 2023, 12:14pm

Hi Hernan,

Yes, I did try removing that but get this back…

SQLParseException[Cannot use non primary key column 'roundts_month' in PARTITIONED BY clause if primary key is set on table]

OK - I’ll give arrays a try as I think we are getting quite a lot of NULLS, although they can be in any of the columns.

hernanc · January 26, 2023, 1:03pm

Hi David,
Apologies I missed the fact that the column needed to be part of the PK to be used for partitioning.
Regarding the use of ARRAYs, I realised there was some ambiguity there.
One possible use of ARRAYs would be to replace columns like d1, d2, and d3 with a single d column defined as an ARRAY.
A different use of ARRAYs would be what is described in Optimizing storage for historic time-series data - Tutorials - CrateDB Community , you may find the approach described in that article interesting if it is something you had not seen before.

hernanc · January 26, 2023, 1:50pm

Just a quick note.
The storage of NULLs is already optimized in CrateDB, so out of these 2 possible uses I mentioned, the one that is interesting to save space is the one in the linked article.

Topic		Replies	Views
Optimizing storage for historic time-series data Tutorials performance , data-storage	10	3477	June 22, 2022
Indexes and storage space CrateDB	5	762	December 15, 2022
Rows need significantly more storage as others in the same table and pretty much the same content SQL	10	871	December 17, 2021
Disproportional space usage CrateDB	4	94	January 13, 2025
Another Indexes question CrateDB	9	525	December 18, 2023

Space savings on data

Related topics