Advice for table/partitioning/sharding strategies

I’m tasked with storing data for a number of IoT devices (more than 50k devices), which is ingested in batches.
Multiple time series at different resolutions are associated to each device, although some devices might not have all of them.
Typical series have the following resolutions:

  • minutes (datapoints are from 1 to 10/15 minutes intervals, sometimes irregularly)
  • hours: one datapoint per hour (with timestamp at minute 0 of the hour)
  • days: one datapoint per day (with timestamp 00:00 of the day)
  • months: one datapoint per month (with timestamp 00:00 of the first day of the month)
  • years: one datapoint per year (with timestamp 00:00 of the first day of the year)

Obviously, the smaller the resolution, the more data we have.
Queries can request ONE timeseries for one or multiple devices, but all the requested series in the query will have the same resolution (eg, hourly series for device x and y, or just for a single device).

Typically, all devices have at least the finer-resolution series (minutes and hours); some devices have more, and some have all series.

If a query requests a series that the wanted device is missing, it needs to be synthesized on the fly by performing aggregation on the first series available that has a higher resolution. This is possible because each device has a type that determines exactly which series it has. (As an aside, it would be great to have this aggregation performed automatically by crateDB, but I don’t know if that’s even possible.)

So, back to the topic, it seems logical to me to have at least the following fields in the table(s):

device_id; timestamp; value

(the application also uses the same device_id as the key to access another external table which holds the details of the device)

Now I was wondering whether I should also add a resolution field in there and keep everything in a single table, or create dedicated tables for the various resolutions (or yet something else). I understand that in turn this affects the partitioning and sharding strategies.
I’m trying to understand if performance can suffer by using a single table for everything.

Thanks for any advice.

1 Like

Are there different retention periods for the different resolutions? Otherwise you could just keep them in one partitioned table e.g.

CREATE TABLE IF NOT EXISTS device_data (
    device_id TEXT,
    ts TIMESTAMP,
    ts_g GENERATED ALWAYS AS DATE_TRUNC('month', ts),
    val DOUBLE
) 
CLUSTERED BY (device_id)
PARTITIONED BY (ts_g);

Now I was wondering whether I should also add a resolution field in there and keep everything in a single table, or create dedicated tables for the various resolutions (or yet something else).

Do you need it for some form of aggregation or could you do e.g.

SELECT
device_id,
date_trunc('hour', ts) as "ts_hour",
AVG(val)
FROM device_data;

anyway?

I’m trying to understand if performance can suffer by using a single table for everything.

Well for the more granular data there will probably be no impact.
But data you would store on a monthly or yearly level would definitely be accessed faster if kept in a smaller table. Does it make a real life difference? Hard to say without knowing your exact use case better.

1 Like

Well, retention is “forever”, so I guess we can say that the retention periods are the same for all series.

I understand you’re suggesting to partition all series by month, be they minute-ly, hourly, daily, monthly or yearly. (yearly data always has timestamp 01/01/xxxx, so it would always end up in january partitions, though since they’re not many data points it shouldn’t be a problem).

If I keep everything in a single table, I need to be able to pick out series at a given resolution, so an extra field (call it resolution or whatever you like) would be needed in that case, eg

SELECT
device_id,
ts,
val
FROM device_data
WHERE resolution = 'hour';    # I'd obviously use a number here, but for clarity

since you could have different entries for the same device_id with eg timestamp 01/01/2024, one would be for the minute series, one for the hourly series, one for the daily series, one for the monthly series, one for the yearly series (first of january is the most blatant corner case, but you get the point). Similarly if I need to do aggregations, I still need to pick out the right values for the series I am aggregating on. So with a single table, an extra field to indicate the resolution for the entry (ie to which series it belongs) would be unavoidable imho.
On the other hand, if I keep n separate tables (data_minute, data_hour, data_day and so on), each dedicated for series with the corresponding resolution only, then no such field would be needed.
A single extra field might not seem much of a deal, but then with billions of entries it might cause some processing overhead (maybe, I don’t know) and extra storage usage (this seems pretty obvious).

So this might go in the direction of keeping separate tables, which, although I didn’t say it earlier, would be my slightly preferred option, everything else being equal.

1 Like

Monthly was just an initial suggestion, not knowing exact data volumes and expected growth. But yes, the assumption was also that yearly data is much less than the per minute data.

You could also have different columns :wink:
CrateDB doesn’t really care for sparse data structures.

CREATE TABLE IF NOT EXISTS device_data (
    device_id TEXT,
    ts TIMESTAMP,
    ts_g GENERATED ALWAYS AS DATE_TRUNC('month', ts),
    val_minute DOUBLE,
    val_hour DOUBLE,
    ....
) 
CLUSTERED BY (device_id)
PARTITIONED BY (ts_g);

… or even keep them in an OBJECT …

CREATE TABLE IF NOT EXISTS device_data (
    device_id TEXT,
    ts TIMESTAMP,
    ts_g GENERATED ALWAYS AS DATE_TRUNC('month', ts),
    val OBJECT AS (
         "minute" DOUBLE,
         "hour" DOUBLE,
       ...  
   )
) 
CLUSTERED BY (device_id)
PARTITIONED BY (ts_g);

that would potentially even make filtering faster.

I would always first go with the simpler solution and don’t preemptively optimize. Be aware we have users with multiple 10s to 100s of TiB in single tables.


you might also want to look into this:

1 Like

This is interesting, but what would the syntax be to extract, for example, the hourly series? I can’t just ask for where val_hour is not NULL, since hourly series might legitimately have NULL values.