Query using MAX on column using calculated partition key on that column not performing as expected

I have a table with billions of records, that looks something like:

CREATE TABLE IF NOT EXISTS "public"."locations" (
   "last_response" TIMESTAMP WITHOUT TIME ZONE NOT NULL,
   "id" BIGINT DEFAULT CAST(current_timestamp AS bigint) NOT NULL,
   
   --details elided
   
   "part_hour" TIMESTAMP WITHOUT TIME ZONE GENERATED ALWAYS AS date_trunc('hour', "last_response"),
   "part_day" TIMESTAMP WITHOUT TIME ZONE GENERATED ALWAYS AS date_trunc('day', "last_response"),
   "part_week" TIMESTAMP WITHOUT TIME ZONE GENERATED ALWAYS AS date_trunc('week', "last_response") NOT NULL,
   "part_month" TIMESTAMP WITHOUT TIME ZONE GENERATED ALWAYS AS date_trunc('month', "last_response"),
   PRIMARY KEY ("id", "part_week")
)
CLUSTERED INTO 6 SHARDS
PARTITIONED BY ("part_week")

This query

SELECT MAX(last_response) from public.locations WHERE last_response >= '2021-01-23 00:00:00.000000' AND last_response < '2022-01-01 00:00:00.000000';

The EXPLAIN is:


HashAggregate[max(last_response)] (rows=1)
  + Collect[public.locations | [last_response] | ((last_response >= 1611360000000::bigint) AND (last_response < 1640995200000::bigint))] (rows=0)

and takes a long time (more than a minute) to complete.

As the table is partitioned on part_week, which is in turn based on last_response, I would have expected the logic to be “identify shard with highest part_week, return max(last_response) from that”, and it should return nearly instantly.

Where is my understanding flawed please?

With many thanks.

Hi,
Could you try adding this to your WHERE clause?

AND last_response>= (SELECT MAX(a.part_week) FROM public.locations a)

It took half as long, and interestingly returned NULL result.

The explain is

MultiPhase (rows=1)
  └ HashAggregate[max(last_response)] (rows=1)
    └ Collect[public.locations | [last_response] | (((last_response >= 1611360000000::bigint) AND (last_response < 1640995200000::bigint)) AND (last_response >= (SELECT max(part_week) FROM (a))))] (rows=0)
  └ Limit[2::bigint;0::bigint] (rows=1)
    └ HashAggregate[max(part_week)] (rows=1)
      └ Rename[part_week] AS a (rows=7845405573)
        └ Collect[public.locations | [part_week AS date_trunc('week', last_response)] | true] (rows=7845405573)

Sorry, my mistake, try

AND last_response>= (
   SELECT MAX(a.part_week) 
   FROM public.locations a
   WHERE a.part_week <=  date_trunc('week','2022-01-01 00:00:00.000000')
   )

Thanks very much, but no change.

MultiPhase (rows=1)
  └ Limit[100::bigint;0] (rows=1)
    └ HashAggregate[max(last_response)] (rows=1)
      └ Collect[public.locations | [last_response] | (((last_response >= 1611360000000::bigint) AND (last_response < 1640995200000::bigint)) AND (last_response >= (SELECT max(part_week) FROM (a))))] (rows=0)
  └ Limit[2::bigint;0::bigint] (rows=1)
    └ HashAggregate[max(part_week)] (rows=1)
      └ Rename[part_week] AS a (rows=7079955502)
        └ Collect[public.locations | [part_week AS date_trunc('week', last_response)] | (part_week <= 1640563200000::bigint)] (rows=7079955502)

What about this?

AND last_response>= (
   SELECT MAX(values['part_week']) 
   FROM information_schema.table_partitions 
   WHERE values['part_week'] <=  date_trunc('week','2022-01-01 00:00:00.000000')
      AND table_name='locations' 
      AND table_schema='public'
   )

That returned a value, but no quicker than without the new clause.

MultiPhase (rows=1)
  └ Limit[100::bigint;0] (rows=1)
    └ HashAggregate[max(last_response)] (rows=1)
      └ Collect[public.locations | [last_response] | (((last_response >= 1611360000000::bigint) AND (last_response < 1640995200000::bigint)) AND (last_response >= (SELECT max(values['part_week']) FROM (information_schema.table_partitions))))] (rows=0)
  └ Limit[2::bigint;0::bigint] (rows=1)
    └ HashAggregate[max(values['part_week'])] (rows=1)
      └ Collect[information_schema.table_partitions | [values['part_week']] | (((values['part_week'] <= 1640563200000::bigint) AND (table_name = 'locations')) AND (table_schema = 'public'))] (rows=unknown)

Maybe do one last test changing last_response>= ( with part_week = (
and if that does not help please share the output of EXPLAIN ANALYZE

analyze.txt (1.3 MB)

Please find attached. Many thanks

Hi there.

Thanks so much for your help thus far.

Appreciate that the ANALYZE output is super verbose. But I suspect the bottom line (with my inexperienced eye) is that every partition is indeed being considered.

I wonder if you could advise if my underlying assumption (that the partition key being one that is calculated from a given column, means that queries on that original column should in theory eliminate partitions) seems correct on the face of it?

Or would you always expect additional clauses to be needed if its a calculated column?

(I’m trying to more or less lift and shift existing postgres app, so changing queries at all is undesirable)

Thanks again.

Hi,
The 3 execution plans you shared show indeed multiple PartitionIdent being considered.
However on my repro environment on CrateDB 5.9.4 , the query goes straight to the right partition, even without introducing the part_week = clause .

would you always expect additional clauses to be needed if its a calculated column?

No, actually the additional clause should not be needed, see for instance the example on this GitHub ticket closed a number of months ago.

I understand you may be on 5.9 but maybe not on 5.9.4?

I see your original execution plan has (rows=0) but then you were surprised to receive a NULL result, this may point to stale statistics or left-behind partitions following the deletion of some records. Do you have statistics collection disabled by any chance?

Thanks for coming back to me.

Upgraded from 5.9.3 to 5.9.4, but no change.

My config is pretty much default. So it was the default stats values, including enabled. I set it to collect stats every 10 mins just in case, but no change. Is there a more definitive way to be very sure the stats is indeed collected?

I have never deleted any rows from this table.

Experimenting with reducing the range in the query - so that it’s looking at a narrow range of last_response - the query time come down seemingly linearly, and the ANALYZE shows that it is correctly only considering partitions in that range.

But each of the MAX queries on each of those partitions is taking a very long time - about a second. The partitions are largish - 50 gig - with 212m records say. Is that duration expected? Or should the indexing of this TIMESTAMP WITHOUT TIME ZONE field be able to return the max immediately? If so, then that is perhaps the root problem here?

Thanks again.

Appreciate if anyone is able to comment on my last query. It feels like something is broken regarding indexing - I would presume that queries like MAX(X) where X is a timestamp should be immediately satisfied for a shard in almost no time. Is that in fact not the case?

Are there any tips as to how I can proceed with diagnosing the root cause? I’m very reluctant to give up on Crate.

Thanks very much.

Hi,

This is not currently the case, what you could maybe do here is try a different number of shards to see if a different level of parallelism may help.
Also you will probably see that if you narrow the search space further within the partition identifying the max is much faster, so with some additional logic, if you know the general area where the last_response may be, with some trial and error/binary search kind of strategy you may get results quicker.