Hi,
I have just started using CrateDB and I have created two table, the first one with 3,3 Millions of records (Events table) and the second one with 60k records (Stats table).
These are the tables:
create table events (
data_block object(dynamic)as (
"eventTimestamp" timestamp with time zone
)
) clustered into 9 shards
with (number_of_replicas = 1);
create table stats (
metric object(dynamic) as (
"eventTimestamp" timestamp with time zone
)
) clustered into 9 shards
with (number_of_replicas = 1);
I need to use a SELECT JOIN
select "doc"."events"."data_block['eventId']" from
"doc"."events" left outer join "doc"."stats" on "doc"."events"."data_block['eventId']" = "doc"."stats"."metric['eventId']"
where "doc"."stats"."metric['eventId']" is null
order by "doc"."events"."data_block['eventId']" desc limit 10
The query takes forever to be answered.
I did something wrong ?
In mysql is blazing fast.
Regards,
S.
From the query it is not quite clear what you really want to do !?
where "doc"."stats"."metric['eventId']" is null
should this be is not null
?
ORDER BY
is a pipeline breaker, i.e. the result has to be materialised before applying the LIMIT 10
LEFT OUTER JOIN
leads to a Nested Loop Join(Joins β CrateDB: Reference) which performs much worse, then a Hash Join.
Be aware that CrateDB is not a classical relational database and therfore a denormalized schema can significantly perform better in many cases.
Query Plan
Eval[data_block['eventId']]
β Limit[10::bigint;0]
β OrderBy[data_block['eventId'] DESC]
β Filter[(metric['eventId'] IS NULL)]
β NestedLoopJoin[LEFT | (data_block['eventId'] = metric['eventId'])]
β Collect[doc.events | [data_block['eventId']] | true]
β Collect[doc.stats | [metric['eventId']] | true]
what seems a bit strange, is that the filter is not propogated down, to the Collect Phase
1 Like
Hi,
the query search for the eventsId which are not present in the metric table but are in the events table, thatβs the reason I have the:
where "doc"."stats"."metric['eventId']" is null
and not the is not null
.
I got the point about the ORDER BY
you are right without that is much faster.
May I ask you what do you mean with a denormalized schema
?
Thanks
A schema that would combine both tables
the query search for the eventsId which are not present in the metric table but are in the events table
then maybe such a query could be an alternative?
SELECT events.data_block['eventId'] FROM events
WHERE events.data_block['eventId'] NOT IN (SELECT stats.metric['eventId'] FROM stats)
LIMIT 1000;
2 Likes
WOW,
thanks!
this works quite fast and do exactly what I was looking for.
Thanks for your help,
regards.
S.
1 Like