Resampling time-series data with DATE_BIN

Introduction

CrateDB 4.7 adds a new DATE_BIN function, offering greater flexibility for grouping rows into time buckets. This article will show how to use DATE_BIN to group rows into time buckets and resample the values.

In previous versions, DATE_TRUNC could be used for similar requirements but had its limitations. With DATE_TRUNC, timestamps can be truncated on different intervals, such as seconds, minutes, hours, and so on. But it is not possible to create time buckets based on more complex intervals, such as 10 seconds, 5 minutes, or similar intervals.

Data model

We assume a simple time-series table as a basis:

CREATE TABLE sensor_readings (
   time TIMESTAMP WITH TIME ZONE NOT NULL,
   sensor_id INTEGER NOT NULL,
   battery_level DOUBLE PRECISION,
   battery_status TEXT,
   battery_temperature DOUBLE PRECISION
);

The table is filled with a number of rows. The rows follow no clear sampling rate (i.e. are not consistently appearing every minute).

Time-bucketing and resampling

As measurements in our table occur in irregular intervals, we now want to resample them in time buckets of five minutes. For each time bucket, one row should be returned. Different approaches are possible to achieve this downsampling. We will use the latest row in each time bucket as the representative, but other approaches (like calculating the average) are also possible.

This leads to the following query, which:

  1. Uses DATE_BIN to generate time buckets of 5 minutes each
  2. Assigns every row within a time bucket a ROW_NUMBER, ordered by timestamp descending. This means, that the latest row will be assigned the value 1.
  3. The query is wrapped in a subquery so that the row_number column can be referenced and filtered in the WHERE clause
SELECT ts_bin,
       battery_level,
       battery_status,
       battery_temperature
FROM (
  SELECT DATE_BIN('5 minutes'::INTERVAL, "time", 0) AS ts_bin,
         battery_level,
         battery_status,
         battery_temperature,
         ROW_NUMBER() OVER (PARTITION BY DATE_BIN('5 minutes'::INTERVAL, "time", 0) ORDER BY "time" DESC) AS "row_number"
  FROM doc.sensor_readings
) x
WHERE "row_number" = 1
ORDER BY 1 ASC

The output shows that three time buckets have been created based on the table’s rows:

Conclusion

With the addition of DATE_BIN in CrateDB 4.7, highly flexible intervals can now be used to define time buckets. DATE_TRUNC can still be used as an alternative for simple time bucketing requirements, if sufficient.

Want to learn more about working with time series data in CrateDB? Our Advanced Time Series course is a free, online course that will teach you all you need to know. Earn a certificate of completion to share on your LinkedIn profile! Enroll at learn.cratedb.com.

3 Likes

@hammerhead, thank you for bringing up this topic. :star_struck:

Let’s assume :crazy_face: I have another data model approach which is as follows:


I would like to perform fast aggregations and resampling on each of the sensors. I came up with this design because the sensors in my case are very dynamic, and I didn’t want to fill up my database with thousands of columns that contain no data :face_with_spiral_eyes:. However, I am now facing some real performance bottlenecks when it comes to resampling and WHERE conditions.

How would you approach resampling on such a design? Additionally, is this design bad in general for time-series databases like CrateDB?

Regards, Schabi

Hi @SchabiDesigns,

Can you share a few details on how your table is set up in regards to sharding/partitioning, as well as data volume? The output of the following two queries would be interesting to see:

SHOW CREATE TABLE your_schema_name.your_table_name;

SELECT id, num_docs, size
FROM sys.shards
WHERE primary = TRUE
   AND schema_name = 'your_schema_name'
   AND table_name = 'your_table_name'

And what does the resampling query look like that shows slow performance?

1 Like

Hi Niklas @hammerhead
Sure thing… thanks for your time! :star_struck:

CREATE TABLE IF NOT EXISTS "my_schema"."dataset" (
   "experiment_id" BIGINT,
   "sensor_short_name" VARCHAR(20),
   "date_time" TIMESTAMP WITH TIME ZONE,
   "quantity" DOUBLE PRECISION,
   "partition_id" SMALLINT
)
CLUSTERED INTO 16 SHARDS
PARTITIONED BY ("partition_id")
WITH (
   "allocation.max_retries" = 5,
   "blocks.metadata" = false,
   "blocks.read" = false,
   "blocks.read_only" = false,
   "blocks.read_only_allow_delete" = false,
   "blocks.write" = false,
   codec = 'default',
   column_policy = 'strict',
   "mapping.total_fields.limit" = 1000,
   max_ngram_diff = 1,
   max_shingle_diff = 3,
   number_of_replicas = '0-1',
   "routing.allocation.enable" = 'all',
   "routing.allocation.total_shards_per_node" = -1,
   "store.type" = 'fs',
   "translog.durability" = 'REQUEST',
   "translog.flush_threshold_size" = 536870912,
   "translog.sync_interval" = 5000,
   "unassigned.node_left.delayed_timeout" = 60000,
   "write.wait_for_active_shards" = '1'
)

the other query looks a bit weird to me… seems like there are some reminders from my previous clearing action :upside_down_face:. partition id makes sure partitions are around 160GB each.

anyway I think the main problem causes from my query I have to perform to get the data I need…

in the easiest case
SELECT * FROM my_schema.dataset WHERE experiment_id IN *{ids}*
→ up to 1000 ids :roll_eyes:

in the worst case

SELECT a.* FROM (
  SELECT experiment_id, sensor_short_name, date_time, quantity FROM my_schema.dataset 
  WHERE experiment_id IN *{ids}*
) AS a INNER JOIN (
  SELECT experiment_id, date_time FROM (
    SELECT *, ROW_NUMBER() OVER (ORDER BY experiment_id, quantity) AS row_num FROM my_schema.dataset
    WHERE sensor_short_name = 'tcy' AND experiment_id IN *{ids}*                                                        
  ) AS cycles
  WHERE cycles.row_num % *{every}* = 0
) AS b
ON (a.experiment_id = b.experiment_id AND a.date_time = b.date_time);

In this case I take only some of the data depending on the every variable. I would rather resample like you did…

→ I would like to filter by experiment_id and then resample for all the available sensors. No idea how to get to this in crate itself… python sure thing but that means lot of data-loading in the first place and very bad user experience :nauseated_face:

Things get very nasty with this kind of database table design, thought it would be clever in this way due to very flexible data formats :face_with_spiral_eyes:

Regards
Schabi

Hi @SchabiDesigns,

The partitioning strategy is probably not ideal for those types of queries. Mainly because partition_id is not used as a WHERE condition in your queries. For the two queries you shared, CrateDB needs to go through the entire table (all partitions), even if all relevant rows are in just one partition (since there is no WHERE condition on partition_id). This will make the query slower with more partitions.

Without knowing the details around your data model, do you see an option to partition by another column that you can filter by in your queries? experiment_id may or may not be an option, depending on how many experiments you store in the table (should not exceed 1000 partitions per table) and how data volumes differ between experiments (equal partition size).

For the approach with ROW_NUMBER() and downsampling using the modulo operator, you may also find the build-in _docid column useful. A condition such as WHERE _docid % 10 returns every 10th document without having to compute row numbers. However, it is slightly different compared to your approach where you generate the row numbers after filtering for a set of experiment_ids and with an ORDER BY. The _docid numbering is applied globally across the table/partition at the time of ingestion.

2 Likes

Thanks for your support @hammerhead

Yes the partition_id I could include in the query. Thanks for that!

I don’t see experiment_id as an option for partitioning due to unknown data size. One experiment can be only 1 value another can be 1GB of data. The number of sensors for each experiment is also variable from 1 to 50 or more :roll_eyes:
I will try your recommendation to use the _docid approach and check if that helps a bit :+1:

Maybe a general question. How would you design a table to store such variable data and still be able to query it fast?