Handling duplicate records

Prevent duplicates

Your first option is to prevent it from happening in the first place by defining a PRIMARY KEY for your tables. If you try to insert data with the same PRIMARY KEY a DuplicateKeyException is thrown.

Example:

CREATE TABLE test ( 
  machine_id TEXT, 
  ts TIMESTAMP, 
  payload OBJECT, 
  PRIMARY KEY (machine_id, ts)
);
INSERT INTO test ( machine_id, ts, payload ) VALUES 
  ('machine01', '2021-05-10T08:00', '{"value": 1}');

If you try to insert the same record twice CrateDB prevents it and throws a DuplicateKeyException.

Remove duplicates

Given you have inserted duplicate records already and want to remove them. An easy query for this could be:

DELETE FROM test_duplicates WHERE _id IN (
  SELECT arbitrary(_id) AS ids FROM test_duplicates 
    GROUP BY machine_id, ts HAVING COUNT(*) > 1);

Group your records by the column(s) which identify your duplicates. Repeat the query multiple times if you inserted the duplicates multiple times.

Please note that this query deletes a random duplicate which could (depending on your data model) not be what you want (e.g. records are joined with another table) or when your duplicates are not true duplicates (e.g. inserted some duplicates with corrupted data).

Example table and insert to try the query for yourself:

CREATE TABLE test_duplicates ( 
  machine_id TEXT, 
  ts TIMESTAMP, 
  payload OBJECT
);
INSERT INTO test_duplicates (ts, machine_id, payload) VALUES 
  ('2020-05-06T08:00', 'm01', '{"value": 1}'), 
  ('2020-05-06T08:00', 'm01', '{"value": 1}'), 
  ('2020-05-06T08:01', 'm01', '{"value": 1}'),
  ('2020-05-06T08:02', 'm01', '{"value": 1}'), 
  ('2020-05-06T08:02', 'm01', '{"value": 1}'),
  ('2020-05-06T08:00', 'm02', '{"value": 1}'), 
  ('2020-05-06T08:00', 'm02', '{"value": 1}'),
  ('2020-05-06T08:00', 'm02', '{"value": 1}');

Deterministically remove duplicates

Caveat: For this to work, you need to be able to order your records. Thus, this does not work for exact duplicates.

DELETE FROM test_duplicates WHERE _id IN (
  SELECT first_id FROM (
    SELECT first_id, COUNT(*) AS cnt FROM (
      SELECT FIRST_VALUE(_id) 
        OVER (PARTITION BY ts, machine_id ORDER BY ts) AS first_id 
      FROM "test_duplicates") tbl 
    GROUP BY first_id) tbl1 
    WHERE cnt > 1);
3 Likes