I’ve got a query to show new users vs returning users and it was working very quickly during development (under 500ms) but once we started using production levels of data execution time has spiraled out of control and is taking 20 seconds on average for only 1 months worth of data.
Interestingly even if I omit the UNION to let me do a comparison with the prior time period it still takes about 19 seconds. It looks like from the query plan it’s using Hash Joins which should be quick.
SQL Query:
SELECT
min(visits.event_time) as date,
coalesce(sum(visits.sess_count) FILTER (
WHERE
visits.sess_count = 1
), 0) AS new_vis,
coalesce(avg(visits.sess_len) FILTER (
WHERE
visits.sess_count = 1
), 0) AS new_avg,
coalesce(sum(visits.sess_count) FILTER (
WHERE
visits.sess_count > 1
), 0) AS ret_vis,
coalesce(avg(visits.sess_len) FILTER (
WHERE
visits.sess_count > 1
), 0) AS ret_avg,
coalesce(sum(visits.sess_count), 0) AS tot_vis,
coalesce(avg(visits.sess_len), 0) AS tot_avg
FROM
(
SELECT
max(joined_vis.totaltime) AS sess_len,
count(joined_vis.session_id) AS sess_count,
min(joined_vis.event_time) as event_time
FROM
(
SELECT
af_dev.pageview.totaltime AS totaltime,
af_dev.user_session.user_id AS user_id,
af_dev.user_session.session_id AS session_id,
af_dev.pageview.event_time AS event_time,
af_dev.user_session.domain AS domain,
af_dev.user_session.hostname AS hostname
FROM
af_dev.pageview
JOIN af_dev.user_session ON af_dev.pageview.user_id = af_dev.user_session.user_id
WHERE
af_dev.user_session.domain = 'domain.com'
AND af_dev.user_session.hostname = 'www.domain.com'
AND af_dev.user_session.event_time BETWEEN '2022-12-05' AND '2023-01-05'
AND af_dev.pageview.event_time BETWEEN '2022-12-05' AND '2023-01-05'
AND af_dev.pageview.domain = 'domain.com'
AND af_dev.pageview.host = 'www.domain.com'
) AS joined_vis
GROUP BY
joined_vis.user_id,
joined_vis.session_id
) AS visits
UNION
SELECT
min(p_visits.event_time) as date,
sum(p_visits.sess_count) FILTER (
WHERE
p_visits.sess_count = 1
) AS new_vis,
avg(p_visits.sess_len) FILTER (
WHERE
p_visits.sess_count = 1
) AS new_avg,
sum(p_visits.sess_count) FILTER (
WHERE
p_visits.sess_count > 1
) AS ret_vis,
avg(p_visits.sess_len) FILTER (
WHERE
p_visits.sess_count > 1
) AS ret_avg,
sum(p_visits.sess_count) AS tot_vis,
avg(p_visits.sess_len) AS tot_avg
FROM
(
SELECT
max(joined_vis.totaltime) AS sess_len,
count(joined_vis.session_id) AS sess_count,
min(joined_vis.event_time) as event_time
FROM
(
SELECT
af_dev.pageview.totaltime AS totaltime,
af_dev.user_session.user_id AS user_id,
af_dev.user_session.session_id AS session_id,
af_dev.pageview.event_time AS event_time,
af_dev.user_session.domain AS domain,
af_dev.user_session.hostname AS hostname
FROM
af_dev.pageview
JOIN af_dev.user_session ON af_dev.pageview.user_id = af_dev.user_session.user_id
WHERE
af_dev.user_session.domain = 'domain.com'
AND af_dev.user_session.hostname = 'www.domain.com'
AND af_dev.user_session.event_time BETWEEN '2022-11-05' AND '2022-12-05'
AND af_dev.pageview.event_time BETWEEN '2022-11-05' AND '2022-12-05'
AND af_dev.pageview.domain = 'domain.com'
AND af_dev.pageview.host = 'www.domain.com'
) AS joined_vis
GROUP BY
joined_vis.user_id,
joined_vis.session_id
) AS p_visits;
SQL Explain:
GroupHashAggregate[date, new_vis, new_avg, ret_vis, ret_avg, tot_vis, tot_avg]
└ Union[date, ret_vis, ret_avg, new_vis, tot_vis, tot_avg, new_avg]
├ Eval[min(event_time) AS date, coalesce(sum(sess_count) FILTER (WHERE (sess_count > 1::bigint)), 0::bigint) AS ret_vis, coalesce(avg(sess_len) FILTER (WHERE (sess_count > 1::bigint)), 0.0) AS ret_avg, coalesce(sum(sess_count) FILTER (WHERE (sess_count = 1::bigint)), 0::bigint) AS new_vis, coalesce(sum(sess_count), 0::bigint) AS tot_vis, coalesce(avg(sess_len), 0.0) AS tot_avg, coalesce(avg(sess_len) FILTER (WHERE (sess_count = 1::bigint)), 0.0) AS new_avg]
│ └ HashAggregate[min(event_time), sum(sess_count) FILTER (WHERE (sess_count > 1::bigint)), avg(sess_len) FILTER (WHERE (sess_count > 1::bigint)), sum(sess_count) FILTER (WHERE (sess_count = 1::bigint)), sum(sess_count), avg(sess_len), avg(sess_len) FILTER (WHERE (sess_count = 1::bigint))]
│ └ Rename[event_time, sess_count, sess_len] AS visits
│ └ Eval[min(event_time) AS event_time, count(session_id) AS sess_count, max(totaltime) AS sess_len]
│ └ GroupHashAggregate[user_id, session_id | min(event_time), count(session_id), max(totaltime)]
│ └ Rename[session_id, user_id, event_time, totaltime] AS joined_vis
│ └ Eval[session_id AS session_id, user_id AS user_id, event_time AS event_time, totaltime AS totaltime]
│ └ HashJoin[(user_id = user_id)]
│ ├ Collect[af_dev.pageview | [totaltime, event_time, user_id] | ((((event_time >= 1670198400000::bigint) AND (event_time <= 1672876800000::bigint)) AND (host = 'www.domain.com')) AND (domain = 'domain.com'))]
│ └ Collect[af_dev.user_session | [user_id, session_id] | ((((domain = 'netris.ai') AND (hostname = 'www.netris.io')) AND (event_time >= 1670198400000::bigint)) AND (event_time <= 1672876800000::bigint))]
└ Eval[min(event_time) AS date, sum(sess_count) FILTER (WHERE (sess_count > 1::bigint)) AS ret_vis, avg(sess_len) FILTER (WHERE (sess_count > 1::bigint)) AS ret_avg, sum(sess_count) FILTER (WHERE (sess_count = 1::bigint)) AS new_vis, sum(sess_count) AS tot_vis, avg(sess_len) AS tot_avg, avg(sess_len) FILTER (WHERE (sess_count = 1::bigint)) AS new_avg]
└ HashAggregate[min(event_time), sum(sess_count) FILTER (WHERE (sess_count > 1::bigint)), avg(sess_len) FILTER (WHERE (sess_count > 1::bigint)), sum(sess_count) FILTER (WHERE (sess_count = 1::bigint)), sum(sess_count), avg(sess_len), avg(sess_len) FILTER (WHERE (sess_count = 1::bigint))]
└ Rename[event_time, sess_count, sess_len] AS p_visits
└ Eval[min(event_time) AS event_time, count(session_id) AS sess_count, max(totaltime) AS sess_len]
└ GroupHashAggregate[user_id, session_id | min(event_time), count(session_id), max(totaltime)]
└ Rename[session_id, user_id, event_time, totaltime] AS joined_vis
└ Eval[session_id AS session_id, user_id AS user_id, event_time AS event_time, totaltime AS totaltime]
└ HashJoin[(user_id = user_id)]
├ Collect[af_dev.pageview | [totaltime, event_time, user_id] | ((((event_time >= 1667606400000::bigint) AND (event_time <= 1670198400000::bigint)) AND (host = 'www.domain.com')) AND (domain = 'domain.com'))]
└ Collect[af_dev.user_session | [user_id, session_id] | ((((domain = 'domain.com') AND (hostname = 'www.domain.com')) AND (event_time >= 1667606400000::bigint)) AND (event_time <= 1670198400000::bigint))]
Table sizes:
af_dev.pageview 5.7 M records, 1 GB
af_dev.user_session 35,000 records, 16 MB
Crate version: 5.1.2