Advice for table/partitioning/sharding strategies

proddata · June 3, 2024, 3:49pm

Monthly was just an initial suggestion, not knowing exact data volumes and expected growth. But yes, the assumption was also that yearly data is much less than the per minute data.

You could also have different columns
CrateDB doesn’t really care for sparse data structures.

CREATE TABLE IF NOT EXISTS device_data (
    device_id TEXT,
    ts TIMESTAMP,
    ts_g GENERATED ALWAYS AS DATE_TRUNC('month', ts),
    val_minute DOUBLE,
    val_hour DOUBLE,
    ....
) 
CLUSTERED BY (device_id)
PARTITIONED BY (ts_g);

… or even keep them in an OBJECT …

CREATE TABLE IF NOT EXISTS device_data (
    device_id TEXT,
    ts TIMESTAMP,
    ts_g GENERATED ALWAYS AS DATE_TRUNC('month', ts),
    val OBJECT AS (
         "minute" DOUBLE,
         "hour" DOUBLE,
       ...  
   )
) 
CLUSTERED BY (device_id)
PARTITIONED BY (ts_g);

that would potentially even make filtering faster.

I would always first go with the simpler solution and don’t preemptively optimize. Be aware we have users with multiple 10s to 100s of TiB in single tables.

you might also want to look into this:

Topic		Replies	Views
Sharding and partitioning guide for time-series data Tutorials sql , fundamentals , getting-started , performance	0	4902	July 2, 2021
Starting out with cratedb CrateDB	5	1025	May 31, 2022
CrateDB partitioned table vs. TimescaleDB Hypertable CrateDB fundamentals	4	384	February 1, 2024
Optimizing storage for historic time-series data Tutorials performance , data-storage	10	3355	June 22, 2022
Partioning question CrateDB	3	603	January 16, 2023

Advice for table/partitioning/sharding strategies

Related topics