The implementation described in this article applies to older CrateDB versions < 4.7.0. For later versions, a simpler implementation is available in a later revision of this article.
When dealing with time series data, you will likely at some point run into the question of how to deal with incomplete data. This could be caused by a sensor outage, network issues, or any other factors resulting in missing data. In this article, we will discuss two strategies to interpolate such missing values.
Data Model
The data model used hereafter is based on our Generate Time Series Data article, which captures the location of the International Space Station and ingests it into a simple target table:
CREATE TABLE iss (
timestamp TIMESTAMP GENERATED ALWAYS AS CURRENT_TIMESTAMP,
position GEO_POINT
);
On top of this table, we create a view that calculates the distance between the ISS’ position and a random location. It also resamples the data to one record per minute:
create view doc.raw as
select "time",
distance
from (
select date_trunc('minute', i."timestamp") as "time",
row_number() over (partition by date_trunc('minute', i."timestamp")) as "row",
distance(i."position", [144.0427, 22.7383]) / 1000 as distance
from doc.iss i
) x
where x."row" = 1;
The view returns the following rows:
Note the gap between 10:11:00 and 12:55:00 as well as between 12:55:00 and 15:16:00.
The basis of the following interpolation is a query utilizing generate_series
to generate a continuous time axis with a sample rate of one minute and a join to include existing values:
select s."time",
r.distance
from generate_series(
date_trunc('minute', '2021-04-26 06:00:00'),
date_trunc('minute', '2021-04-26 12:00:00'),
'1 minute'::INTERVAL
) s("time")
left join doc.raw r using ("time");
Last Observation Carried Forward
A strategy to fill those two gaps is to take the last observed value and carry it forward.
We extend the above query to include a sum
which orders the values by time
and increases the sum by 1 if a non-null value is observed. Every row of our time axis now includes the number of the last observed value. This way, the result set is partitioned so that null values and the last observed values have the same grp
value.
In the outer query, we now make use of this partitioning and select the value from the partition. We use the max
aggregation function below, but since there is only one non-null value, there is no actual selection of a maximum happening. The usage of max
, min
, avg
, etc all yield the same result.
select "time",
distance,
max(distance) over(partition by grp) as new_distance
from (
select s."time",
distance,
sum(if(distance is null, 0, 1)) over (order by s."time" asc) as grp
from generate_series(
date_trunc('minute', '2021-04-26 06:00:00'),
date_trunc('minute', '2021-04-26 12:00:00'),
'1 minute'::INTERVAL
) s("time")
left join doc.raw r using ("time")
) t
order by t."time" asc;
Next Observation Carried Backward
Carrying a value backward is a slight variation of the above query. Instead of counting non-null values by time ascending, we reverse the order to descending:
select "time",
distance,
max(distance) over (partition by grp) as new_distance
from (
select s."time",
distance,
sum(if(distance is null, 0, 1)) over (order by s."time" desc) as grp
from generate_series(
date_trunc('minute', '2021-04-26 06:00:00'),
date_trunc('minute', '2021-04-26 12:00:00'),
'1 minute'::INTERVAL
) s("time")
left join doc.raw r using ("time")
) t
order by t."time" asc;
Further variations
This pattern can be modified further to implement custom strategies. The below variation calculates the average value over a sliding window of 100 preceding rows and carries it forward:
select s."time",
distance,
avg(distance) filter (where distance is not null) over (order by s."time" asc rows between 100 preceding and current row) as preceding_avg
from generate_series(
date_trunc('minute', '2021-04-26 06:00:00'),
date_trunc('minute', '2021-04-26 12:00:00'),
'1 minute'::INTERVAL
) s("time")
left join doc.raw r using ("time")
order by 1 asc;
The example above is limited in the way that the window considered is of a fixed size (100 rows). If there are more than 100 subsequent null values, the average becomes null as well. To overcome this limitation, we can combine the calculation of the average with the Last Observation Carried Forward pattern:
select "time",
distance,
max(new_distance_avg) over (partition by grp) as new_distance_avg
from (
select t."time",
t.distance,
t.grp,
-- for the last null value and first non-null value, grp is identical, but the null value should not receive the average, so exclude it
if(lag(grp) over (order by "time" asc) = grp and distance is null, null, avg(distance) over (partition by grp_reverse)) as new_distance_avg
from (
select s."time",
distance,
sum(if(distance is null, 0, 1)) over (order by s."time" asc) grp,
sum(if(distance is null, 1, 0)) over (order by s."time" asc) grp_reverse
from generate_series(
date_trunc('minute', '2021-04-26 06:00:00'),
date_trunc('minute', '2021-04-26 12:00:00'),
'1 minute'::INTERVAL
) s("time")
left join doc.raw r using ("time")
) t
) t
order by t."time" asc;