What would be the basic optimization techniques/parameters for a time series table in this example?

Hello,

I’m aware of Optimizing storage for historic time-series data, but in this specific question, I’m not interested data retention or moving old (cold) data to another table, etc.

I want to understand better what kind of basic optimizations/techniques should be used when creating a time series table with the following structure, stored on a 3-node CrateDB cluster, each node with 64 GB RAM , 16 CPU cores and 500 GB storage:

CREATE TABLE IF NOT EXISTS emre.test_table_001 (
  time timestamptz NOT NULL,
  location_id text NOT NULL,
  device_id text NOT NULL, 
  sensor_reading_01 double precision,
  sensor_reading_02 double precision,
  ...
  sensor_reading_50 double precision,
);

Combination of (time, location_id, device_id) is unique .

Let’s assume that:

  • this table stores sensor readings from various devices identified by device_id , residing at a geographical location identified by a location_id .
  • the sensor data is sent to a .NET application, and this .NET application INSERTs data into this table every few seconds.
  • other .NET applications SELECT from this table to show various dashboards to users, whenever users request these via a web application.
  • a user is generally interested in the time series data for 1 or more devices in a single location in a given web page request.
  • currently there are about ~80 devices scattered to ~50 different locations, sending sensor readings every second.
  • hot data ’ should be available for about 3 months, meaning, currently hot data is about 100 million rows , but as new locations and devices are added, the amount of ‘ hot data ’ is estimated to be on the order a few billion rows .

Based on these assumptions and the ‘naive’ table CREATE command above, what would CrateDB experts say, with respect to the following topics and table parameters:

  • Typically one should aim for a single shard size of about 10-50 GiB
  • Partitions can be deleted/dropped like a table. i.e. if you want to delete old data quickly, you should partition by e.g. month
  • typically having less shards is more performant, than having too many

@jayeff wrote a basic guide for time series data

also some core information on sharding:

2 Likes