LTTB downsampling to work around Grafana SQL limits

Hi,

still trying to get me feet wet with CrateDB as a Prometheus long term storage back-end. As I was running into timeout problems when querying via Prometheus I switched to “Grafana SQL” to interact directly with CrateDB.

But given that there are 1.5B records in the metrics table after just 2 months, even limiting to very few instances I constantly run into SQL query limits due to way too many rows being returned.

I found Hernan’s nice tutorial quick bite, loaded the user defined functions into our 3 node test cluster, but I am non the wiser, probably due to wrong function prototypes:

crash yields

cr> SELECT routine_schema, specific_name
    FROM information_schema.routines
    WHERE routine_name='lttb_with_parallel_arrays';
+----------------+---------------------------------------------------------------+
| routine_schema | specific_name                                                 |
+----------------+---------------------------------------------------------------+
| doc            | lttb_with_parallel_arrays(integer_array, real_array, integer) |
+----------------+---------------------------------------------------------------+
SELECT 1 row in set (0.011 sec)

preliminary queries seem to work just fine:

SELECT count(timestamp) FROM metrics WHERE timestamp BETWEEN '2023-09-19T05:46:07.927Z' AND '2023-09-20T05:46:07.929Z' AND labels['inst
    ance'] = 'a5602' ;
+------------------+
| count(timestamp) |
+------------------+
|            29607 |
+------------------+
SELECT 1 row in set (2.948 sec)

SELECT count(value) FROM metrics WHERE timestamp BETWEEN '2023-09-19T05:46:07.927Z' AND '2023-09-20T05:46:07.929Z' AND labels['instance
    '] = 'a5602' ;
+--------------+
| count(value) |
+--------------+
|        29607 |
+--------------+
SELECT 1 row in set (0.049 sec)

but when I fire up the down sampling query it stops short of not finding a proper function (with or without prefixing doc.:

SELECT lttb_with_parallel_arrays(array(SELECT timestamp FROM metrics WHERE timestamp BETWEEN '2023-09-19T05:46:07.927Z' AND '2023-09-20
    T05:46:07.929Z' AND labels['instance'] = 'a5602' ORDER BY timestamp), array(select value from metrics WHERE timestamp BETWEEN '2023-09-
    19T05:46:07.927Z' AND '2023-09-20T05:46:07.929Z' AND labels['instance'] = 'a5602' order by timestamp),100);
UnsupportedFunctionException[Unknown function: lttb_with_parallel_arrays((SELECT timestamp FROM (doc.metrics)), (SELECT value FROM (doc.metrics)), 100), no overload found for matching argument types: (timestamp without time zone_array, double precision_array, integer). Possible candidates: doc.lttb_with_parallel_arrays(array(integer), array(real), integer):object]

io.crate.exceptions.UnsupportedFunctionException: Unknown function: lttb_with_parallel_arrays((SELECT timestamp FROM (doc.metrics)), (SELECT value FROM (doc.metrics)), 100), no overload found for matching argument types: (timestamp without time zone_array, double precision_array, integer). Possible candidates: doc.lttb_with_parallel_arrays(array(integer), array(real), integer):object
        at io.crate.metadata.Functions.raiseUnknownFunction(Functions.java:357)
[...]

I think I somehow need to convince CrateDB/cast the arrays into integer/real one’s but how?

Thanks a lot in advance

Carsten

d’oh

even while seeing

"
In this case we are working with an x-axis with integers, but this approach works equally well using timestamps, just changing the script where it says ARRAY(INTEGER) to ARRAY(TIMESTAMP WITH TIME ZONE).
"

I probably just messed up copying/pasting this into CrateDB.

Perhaps unsurprisingly, it starts working with the correct prototype
CREATE OR REPLACE FUNCTION lttb_with_parallel_arrays (xarray ARRAY(TIMESTAMP WITHOUT TIME ZONE),yarray ARRAY(DOUBLE), threshold int)

I’ll continue investigating of results make sense - sorry for the noise!

Dear Carsten,

Thanks for the reminder. We will need to wrap up our improvements there and finally ship a new release including them.

CrateDB gained server-side cursor support the other day, to handle exactly this use case of needing to send huge amounts of result data to the client, by enabling it to actively page through the results. However, this technique is probably not applicable for visualization/monitoring/analytics applications and use cases like/with Prometheus and Grafana.

For those scenarios, downsampling the data on demand, to reduce the amount of data submitted to Grafana, is the right choice.

Excellent! We will need to ask the database team whether it would be feasible to ship an LTTB function integrated with CrateDB. Do we already have a ticket for that, @hernanc?

Good to hear you figured it out how to apply this recipe. We will be happy to hear about the outcome in your scenario. Please don’t stop asking questions, or telling us about your progress, anytime.

With kind regards,
Andreas.

Just to add one detail: It does not necessarily need to be LTTB, which is already an advanced method I guess?

The basic Grafana query example over here just uses the arithmetic mean. However, most importantly, it demonstrates how to bring Grafana’s time range filter into the query, so that it will only request the data you are intending to retrieve, on behalf of what has been selected on its time range chooser [1].

@hammerhead helped me to discover the right SQL statement on this matter, thank you again.

P.S.: Sure enough, I also intend to use LTTB over there on behalf of a subsequent iteration, to level up a bit.


  1. When not using that, your systems will probably explode, due to too much data being shuffled around for nothing. ↩︎

Very interesting! And no, I’m not set on lttb - I just happen to have found it in this context.

In principle, anything, which helps making long range Grafana queries manageable, incorporating it’s wanted resolution (based on time range and displayed resolution) should be fine for now.

Thanks a bunch!

1 Like

All right. Then, I suggest to start with the basic example, using arithmetic mean, based on the query expression I’ve shared above, in order to check if that will not kill your browser, or other upstream systems ;]. Considering it will make you happy instantly, which I dearly hope, you can level up from that gradually.

Hi everyone,

complete newbie here. I am using CrateDB to log power consumption and photovoltaic data and want to speed up some Grafana panels. The power consumption is logged by a Shelly 3em, which gives a data point for all three electrial phases roughly every 3 seconds, so some data builds up over the years. Other readings come in from e.g. the solar inverter.
I am using an old laptop as a server and see slow Grafana panels e.g. when viewing on mobile phones.

I am looking to implement the averaging based on the LTTB method and already added the UDF with the modification to use “TIMESTAMP WITH TIME ZONE”.

Now running a query on CrateDB console looks successfull:

WITH downsampleddata AS 
	 (	SELECT lttb_with_parallel_arrays(	
			array(SELECT time FROM power_consumption ORDER BY time),							
			array(SELECT phase_0_momentary FROM power_consumption ORDER BY time)
			,100) AS lttb)
	SELECT unnest(lttb['0']) as time,unnest(lttb['1']) AS phase0
	FROM downsampleddata;

returning

time	phase0

1708718549332 10.4399995803833
1708719821353 11.59000015258789
...

but I am facing problems getting this over to Grafana (10.3.3).
My query is the same as above, but I am receiving

converting time columns failed: failed to convert time column: column type “*string” is not convertible to time.Time

This might be the wrong spot here to ask, but may be there is a simple solution to this? I am thinking in the direction of CASTing, but I am even unsure where to place a possible type conversion…

Thanks for any help, I think it might help others get along as well,

Joost

1 Like

Ok, so a bit of digging further and trial-and-error I arrived at this:

WITH downsampleddata AS 
	 (	SELECT lttb_with_parallel_arrays(	
			array(SELECT time FROM power_consumption WHERE $__timeFilter(time) GROUP BY time ORDER BY time),							
			array(SELECT phase_0_momentary FROM power_consumption WHERE $__timeFilter(time) ORDER BY time)
			,5000) AS lttb )
	SELECT date_format(unnest(lttb['0']))::timestamp with time zone as time, unnest(lttb['1'])::REAL AS phase0
	FROM downsampleddata;

I’ll gladly receive any comments for optimisation or bug correction…

I imagine the number “5000” as parameter defines the returned number of data points (?). Now I’m off to figure out how to get the optimal value for this from Grafana depending on panel width/screen resolution on different devices.

1 Like

Hi,
Just a few comments.
5000 defines indeed the number of returned data points.
I think the GROUP BY time should not be necessary.
Also in date_format(unnest(lttb['0']))::timestamp the double casting is probably redundant, I think you could remove the call to date_format.
I hope this helps.
Thank you.

1 Like

Hi,

thanks a lot, adapted the code and it works fine!

Just on a side note, I’m curious what your opinion on potential perfomance impact using those JS functions is.
Might there be an upper quantity limit of data points which can be successfully downsampled by this function?

Thanks again and best regards,

Joost

1 Like

Hi,
The function receives all rows in input at once, so there will be a limit there around the amount of heap space available, but you could work around this by breaking your data into smaller chunks and running the function on each bucket, for instance instead of trying to run the function to downsample one day of data to 5,000 data point we could instead downsample each hour to 200 data points.

1 Like