SELECT with Join takes forever

fox_pluto · November 14, 2021, 9:37am

Hi,

I have just started using CrateDB and I have created two table, the first one with 3,3 Millions of records (Events table) and the second one with 60k records (Stats table).
These are the tables:

create table events (
	data_block object(dynamic)as (
		"eventTimestamp" timestamp with time zone
	)
) clustered into 9 shards
with (number_of_replicas = 1);

create table stats (
	metric object(dynamic) as (
		"eventTimestamp" timestamp with time zone
	)
) clustered into 9 shards
with (number_of_replicas = 1);

I need to use a SELECT JOIN

select "doc"."events"."data_block['eventId']" from 
"doc"."events" left outer join "doc"."stats" on "doc"."events"."data_block['eventId']" = "doc"."stats"."metric['eventId']" 
where "doc"."stats"."metric['eventId']" is null 
order by "doc"."events"."data_block['eventId']" desc limit 10

The query takes forever to be answered.
I did something wrong ?
In mysql is blazing fast.

Regards,
S.

proddata · November 14, 2021, 1:44pm

From the query it is not quite clear what you really want to do !?

where "doc"."stats"."metric['eventId']" is null

should this be is not null?

ORDER BY is a pipeline breaker, i.e. the result has to be materialised before applying the LIMIT 10

LEFT OUTER JOIN leads to a Nested Loop Join(Joins — CrateDB: Reference) which performs much worse, then a Hash Join.

Be aware that CrateDB is not a classical relational database and therfore a denormalized schema can significantly perform better in many cases.

Query Plan

Eval[data_block['eventId']]
  └ Limit[10::bigint;0]
    └ OrderBy[data_block['eventId'] DESC]
      └ Filter[(metric['eventId'] IS NULL)]
        └ NestedLoopJoin[LEFT | (data_block['eventId'] = metric['eventId'])]
          ├ Collect[doc.events | [data_block['eventId']] | true]
          └ Collect[doc.stats | [metric['eventId']] | true]

what seems a bit strange, is that the filter is not propogated down, to the Collect Phase

fox_pluto · November 14, 2021, 10:12pm

Hi,
the query search for the eventsId which are not present in the metric table but are in the events table, that’s the reason I have the:

where "doc"."stats"."metric['eventId']" is null

and not the is not null.

I got the point about the ORDER BY you are right without that is much faster.
May I ask you what do you mean with a denormalized schema?
Thanks

proddata · November 15, 2021, 2:11pm

A schema that would combine both tables

the query search for the eventsId which are not present in the metric table but are in the events table

then maybe such a query could be an alternative?

SELECT events.data_block['eventId'] FROM events
WHERE events.data_block['eventId'] NOT IN (SELECT stats.metric['eventId'] FROM stats)
LIMIT 1000;

fox_pluto · November 15, 2021, 4:39pm

WOW,
thanks!
this works quite fast and do exactly what I was looking for.

Thanks for your help,
regards.
S.

Topic		Replies	Views
CrateDB LEFT JOIN performance SQL	0	912	May 23, 2019
Group by query with join taking forever SQL	14	1080	December 21, 2021
Poor performance with five tables using join CrateDB sql	7	439	December 19, 2023
CrateDB \| Performance Optimisation in Left Joins SQL	1	672	July 20, 2021
IllegalStateException with left join query CrateDB	8	609	March 3, 2022

SELECT with Join takes forever

Related topics