I’m tasked with storing data for a number of IoT devices (more than 50k devices), which is ingested in batches.
Multiple time series at different resolutions are associated to each device, although some devices might not have all of them.
Typical series have the following resolutions:
- minutes (datapoints are from 1 to 10/15 minutes intervals, sometimes irregularly)
- hours: one datapoint per hour (with timestamp at minute 0 of the hour)
- days: one datapoint per day (with timestamp 00:00 of the day)
- months: one datapoint per month (with timestamp 00:00 of the first day of the month)
- years: one datapoint per year (with timestamp 00:00 of the first day of the year)
Obviously, the smaller the resolution, the more data we have.
Queries can request ONE timeseries for one or multiple devices, but all the requested series in the query will have the same resolution (eg, hourly series for device x and y, or just for a single device).
Typically, all devices have at least the finer-resolution series (minutes and hours); some devices have more, and some have all series.
If a query requests a series that the wanted device is missing, it needs to be synthesized on the fly by performing aggregation on the first series available that has a higher resolution. This is possible because each device has a type that determines exactly which series it has. (As an aside, it would be great to have this aggregation performed automatically by crateDB, but I don’t know if that’s even possible.)
So, back to the topic, it seems logical to me to have at least the following fields in the table(s):
device_id; timestamp; value
(the application also uses the same device_id
as the key to access another external table which holds the details of the device)
Now I was wondering whether I should also add a resolution
field in there and keep everything in a single table, or create dedicated tables for the various resolutions (or yet something else). I understand that in turn this affects the partitioning and sharding strategies.
I’m trying to understand if performance can suffer by using a single table for everything.
Thanks for any advice.