can you help me understand what I am reading in this documentation?https://crate.io/docs/reference/sql/partitioned_tables.html
In these example tables, column id long
is not a primary_key
; indeed, id
could not be a primary key here, because as noted below “if a primary key is set, it must be present in the PARTITION BY
clause”
In my app, I’ve historically had a primary key
on id string NOT NULL
, but now I want to add partitioning on this table, on a generated date column just like in the example partition_date timestamp GENERATED ALWAYS AS date_trunc('day', created_at)
. I’ve read that partitioning on a date column will help with speed of queries scoped by time period (counting all today’s records, for example, would only hit today’s partitions), and helping me archive older frames of data (e.g. anything > 180 days), but I don’t want to lose performance of single PK lookups.
So since I can’t do just PARTITIONED BY (partition_date)
, is it best if I…
a) Remove the primary key constraint from id
? I’m nervous this would affect my performance for single row lookups! In this context, it makes sense that PK must be in the partition key, because to lookup WHERE id = "abc-123"
should ideally only have to hit a single node.
or
b) use both columns as partition key, like PARTITIONED BY (id, partition_date)
– This seems weird, because instinctively, I want to assume that id
would have high cardinality and be a bad choice for a partition column, and ‘day’ or ‘month’ would be better, like is shown in the example on your docs. In this case, is my PK lookup hitting every partition, or does it know exactly where to go? If I run an aggregate query scoped to today only, will it hit every partition or only the one holding today’s data?