Handling duplicate records

jayeff · May 10, 2021, 8:10am

Prevent duplicates

Your first option is to prevent it from happening in the first place by defining a PRIMARY KEY for your tables. If you try to insert data with the same PRIMARY KEY a DuplicateKeyException is thrown.

Example:

CREATE TABLE test ( 
  machine_id TEXT, 
  ts TIMESTAMP, 
  payload OBJECT, 
  PRIMARY KEY (machine_id, ts)
);

INSERT INTO test ( machine_id, ts, payload ) VALUES 
  ('machine01', '2021-05-10T08:00', '{"value": 1}');

If you try to insert the same record twice CrateDB prevents it and throws a DuplicateKeyException.

Remove duplicates

Given you have inserted duplicate records already and want to remove them. An easy query for this could be:

DELETE FROM test_duplicates WHERE _id IN (
  SELECT arbitrary(_id) AS ids FROM test_duplicates 
    GROUP BY machine_id, ts HAVING COUNT(*) > 1);

Group your records by the column(s) which identify your duplicates. Repeat the query multiple times if you inserted the duplicates multiple times.

Please note that this query deletes a random duplicate which could (depending on your data model) not be what you want (e.g. records are joined with another table) or when your duplicates are not true duplicates (e.g. inserted some duplicates with corrupted data).

Example table and insert to try the query for yourself:

CREATE TABLE test_duplicates ( 
  machine_id TEXT, 
  ts TIMESTAMP, 
  payload OBJECT
);

INSERT INTO test_duplicates (ts, machine_id, payload) VALUES 
  ('2020-05-06T08:00', 'm01', '{"value": 1}'), 
  ('2020-05-06T08:00', 'm01', '{"value": 1}'), 
  ('2020-05-06T08:01', 'm01', '{"value": 1}'),
  ('2020-05-06T08:02', 'm01', '{"value": 1}'), 
  ('2020-05-06T08:02', 'm01', '{"value": 1}'),
  ('2020-05-06T08:00', 'm02', '{"value": 1}'), 
  ('2020-05-06T08:00', 'm02', '{"value": 1}'),
  ('2020-05-06T08:00', 'm02', '{"value": 1}');

Deterministically remove duplicates

Caveat: For this to work, you need to be able to order your records. Thus, this does not work for exact duplicates.

DELETE FROM test_duplicates WHERE _id IN (
  SELECT first_id FROM (
    SELECT first_id, COUNT(*) AS cnt FROM (
      SELECT FIRST_VALUE(_id) 
        OVER (PARTITION BY ts, machine_id ORDER BY ts) AS first_id 
      FROM "test_duplicates") tbl 
    GROUP BY first_id) tbl1 
    WHERE cnt > 1);

Topic		Replies	Views
Unable to trigger duplicate key error with multi-record insert CrateDB	2	37	July 21, 2024
Prepared Statement Query Returns Duplicate Records but CrateDB Console Query Doesn’t CrateDB sql	1	36	January 23, 2025
ERROR: A document with the same primary key exists already (SQLSTATE 23505) CrateDB prometheus	2	52	August 1, 2024
Retrieving records in bulk with a list of primary key values Tutorials sql	0	564	February 23, 2024
Designing schema to avoid having to use REFRESH TABLE CrateDB	1	655	June 8, 2020

Handling duplicate records

Prevent duplicates

Remove duplicates

Deterministically remove duplicates

Related topics