SELECT with Join takes forever

Hi,

I have just started using CrateDB and I have created two table, the first one with 3,3 Millions of records (Events table) and the second one with 60k records (Stats table).
These are the tables:

create table events (
	data_block object(dynamic)as (
		"eventTimestamp" timestamp with time zone
	)
) clustered into 9 shards
with (number_of_replicas = 1);
create table stats (
	metric object(dynamic) as (
		"eventTimestamp" timestamp with time zone
	)
) clustered into 9 shards
with (number_of_replicas = 1);

I need to use a SELECT JOIN

select "doc"."events"."data_block['eventId']" from 
"doc"."events" left outer join "doc"."stats" on "doc"."events"."data_block['eventId']" = "doc"."stats"."metric['eventId']" 
where "doc"."stats"."metric['eventId']" is null 
order by "doc"."events"."data_block['eventId']" desc limit 10

The query takes forever to be answered.
I did something wrong ?
In mysql is blazing fast.

Regards,
S.

From the query it is not quite clear what you really want to do !?

where "doc"."stats"."metric['eventId']" is null 

should this be is not null?


ORDER BY is a pipeline breaker, i.e. the result has to be materialised before applying the LIMIT 10

LEFT OUTER JOIN leads to a Nested Loop Join(Joins β€” CrateDB: Reference) which performs much worse, then a Hash Join.


Be aware that CrateDB is not a classical relational database and therfore a denormalized schema can significantly perform better in many cases.

Query Plan

Eval[data_block['eventId']]
  β”” Limit[10::bigint;0]
    β”” OrderBy[data_block['eventId'] DESC]
      β”” Filter[(metric['eventId'] IS NULL)]
        β”” NestedLoopJoin[LEFT | (data_block['eventId'] = metric['eventId'])]
          β”œ Collect[doc.events | [data_block['eventId']] | true]
          β”” Collect[doc.stats | [metric['eventId']] | true]

what seems a bit strange, is that the filter is not propogated down, to the Collect Phase

1 Like

Hi,
the query search for the eventsId which are not present in the metric table but are in the events table, that’s the reason I have the:

where "doc"."stats"."metric['eventId']" is null

and not the is not null.

I got the point about the ORDER BY you are right without that is much faster.
May I ask you what do you mean with a denormalized schema?
Thanks

A schema that would combine both tables

the query search for the eventsId which are not present in the metric table but are in the events table

then maybe such a query could be an alternative?

SELECT events.data_block['eventId'] FROM events
WHERE events.data_block['eventId'] NOT IN (SELECT stats.metric['eventId'] FROM stats)
LIMIT 1000;
2 Likes

WOW,
thanks!
this works quite fast and do exactly what I was looking for.

Thanks for your help,
regards.
S.

1 Like