LTTB downsampling to work around Grafana SQL limits

Hi,

still trying to get me feet wet with CrateDB as a Prometheus long term storage back-end. As I was running into time out problems when querting vua Prometheus I switched to “Grafana SQL” to interact directly with CrateDB.

But given that there are 1.5B records in the metrics table after just 2 months, even limiting to very few instances I constantly run into SQL query limits due to way too many rows being returned.

I found Hernan’s nice tutorial quick bite, loaded the user defined functions into our 3 node test cluster, but I am non the wiser, probably due to wrong function prototypes:

crash yields

cr> SELECT routine_schema, specific_name
    FROM information_schema.routines
    WHERE routine_name='lttb_with_parallel_arrays';
+----------------+---------------------------------------------------------------+
| routine_schema | specific_name                                                 |
+----------------+---------------------------------------------------------------+
| doc            | lttb_with_parallel_arrays(integer_array, real_array, integer) |
+----------------+---------------------------------------------------------------+
SELECT 1 row in set (0.011 sec)

preliminary queries seem to work just fine:

SELECT count(timestamp) FROM metrics WHERE timestamp BETWEEN '2023-09-19T05:46:07.927Z' AND '2023-09-20T05:46:07.929Z' AND labels['inst
    ance'] = 'a5602' ;
+------------------+
| count(timestamp) |
+------------------+
|            29607 |
+------------------+
SELECT 1 row in set (2.948 sec)

SELECT count(value) FROM metrics WHERE timestamp BETWEEN '2023-09-19T05:46:07.927Z' AND '2023-09-20T05:46:07.929Z' AND labels['instance
    '] = 'a5602' ;
+--------------+
| count(value) |
+--------------+
|        29607 |
+--------------+
SELECT 1 row in set (0.049 sec)

but when I fire up the down sampling query it stops short of not finding a proper function (with or without prefixing doc.:

SELECT lttb_with_parallel_arrays(array(SELECT timestamp FROM metrics WHERE timestamp BETWEEN '2023-09-19T05:46:07.927Z' AND '2023-09-20
    T05:46:07.929Z' AND labels['instance'] = 'a5602' ORDER BY timestamp), array(select value from metrics WHERE timestamp BETWEEN '2023-09-
    19T05:46:07.927Z' AND '2023-09-20T05:46:07.929Z' AND labels['instance'] = 'a5602' order by timestamp),100);
UnsupportedFunctionException[Unknown function: lttb_with_parallel_arrays((SELECT timestamp FROM (doc.metrics)), (SELECT value FROM (doc.metrics)), 100), no overload found for matching argument types: (timestamp without time zone_array, double precision_array, integer). Possible candidates: doc.lttb_with_parallel_arrays(array(integer), array(real), integer):object]

io.crate.exceptions.UnsupportedFunctionException: Unknown function: lttb_with_parallel_arrays((SELECT timestamp FROM (doc.metrics)), (SELECT value FROM (doc.metrics)), 100), no overload found for matching argument types: (timestamp without time zone_array, double precision_array, integer). Possible candidates: doc.lttb_with_parallel_arrays(array(integer), array(real), integer):object
        at io.crate.metadata.Functions.raiseUnknownFunction(Functions.java:357)
[...]

I think I somehow need to convince CrateDB/cast the arrays into integer/real one’s but how?

Thanks a lot in advance

Carsten

d’oh

even while seeing

"
In this case we are working with an x-axis with integers, but this approach works equally well using timestamps, just changing the script where it says ARRAY(INTEGER) to ARRAY(TIMESTAMP WITH TIME ZONE).
"

I probably just messed up copying/pasting this into CrateDB.

Perhaps unsurprisingly, it starts working with the correct prototype
CREATE OR REPLACE FUNCTION lttb_with_parallel_arrays (xarray ARRAY(TIMESTAMP WITHOUT TIME ZONE),yarray ARRAY(DOUBLE), threshold int)

I’ll continue investigating of results make sense - sorry for the noise!

Dear Carsten,

Thanks for the reminder. We will need to wrap up our improvements there and finally ship a new release including them.

CrateDB gained server-side cursor support the other day, to handle exactly this use case of needing to send huge amounts of result data to the client, by enabling it to actively page through the results. However, this technique is probably not applicable for visualization/monitoring/analytics applications and use cases like/with Prometheus and Grafana.

For those scenarios, downsampling the data on demand, to reduce the amount of data submitted to Grafana, is the right choice.

Excellent! We will need to ask the database team whether it would be feasible to ship an LTTB function integrated with CrateDB. Do we already have a ticket for that, @hernanc?

Good to hear you figured it out how to apply this recipe. We will be happy to hear about the outcome in your scenario. Please don’t stop asking questions, or telling us about your progress, anytime.

With kind regards,
Andreas.

Just to add one detail: It does not necessarily need to be LTTB, which is already an advanced method I guess?

The basic Grafana query example over here just uses the arithmetic mean. However, most importantly, it demonstrates how to bring Grafana’s time range filter into the query, so that it will only request the data you are intending to retrieve, on behalf of what has been selected on its time range chooser [1].

@hammerhead helped me to discover the right SQL statement on this matter, thank you again.

P.S.: Sure enough, I also intend to use LTTB over there on behalf of a subsequent iteration, to level up a bit.


  1. When not using that, your systems will probably explode, due to too much data being shuffled around for nothing. ↩︎

Very interesting! And no, I’m not set on lttb - I just happen to have found it in this context.

In principle, anything, which helps making long range Grafana queries manageable, incorporating it’s wanted resolution (based on time range and displayed resolution) should be fine for now.

Thanks a bunch!

1 Like

All right. Then, I suggest to start with the basic example, using arithmetic mean, based on the query expression I’ve shared above, in order to check if that will not kill your browser, or other upstream systems ;]. Considering it will make you happy instantly, which I dearly hope, you can level up from that gradually.