Partition requires significantly a lot more space than the others

Good morning!

Currently I`m ingesting 11Mio. rows of IoT timeseries data through Kafka into cratedb. I use a single table with a rather flat schema for it, which currently hold ~70Mio rows for the last 3 years. This large table is partitioned over multiple fields (Type of device, Year, Source) so that a partition has a couple of Millions of rows at worst, most have only some hundrets of thousands. So far that worked for me quite acceptable, but I am still struggling to tune this correctly.

At this specific Ingestion process, I have two major issues:

A) Ingestion performance is very poor… only about 100 Rows per second at best. Batches of about 1000 Rows are sent to the HTTP Endpoint via a Bulk-Insert. The table has only 1 replica (as there is all data basically in there so I don’t want to lose anything on node failure). Every partition is distributed over 6 shards. I have 5 Nodes runnning, each has 4 CPUs and 8GB RAM (which isn’t much at all, I know). They are running as Kubernetes Pods in AKS and persisting to Azurefile Volumes (Azure Storage Account) which I should also probably switch to Premium-SSD Disks instead. As this whole thing is just a Proof-of-concept system not productive at all), my question is just, if I do anything fatally wrong on the database side or is the ingestion performance just so bad because of the Hardware it is running on.

B) The data I am ingesting gets located into its own partition. The data in the different partitions is totally compareable in structure and size, so it should basically need the same amount of space. But this isn’t the case…the partition needs about 20-times the space for the same amount of rows as data in a comparable partition. I don’t have a clue why this is the case…any suggestions on this?

1 Like

This looks like you setup the table with 4 replicas and 6 shards per partition with primaries, this would lead to 30 shards per partition (and poor performance)

-Is this really intended (having 5 copies of all shards)?

  • Could you share your table schema? (at least the partition columns) and CLUSTERED / PARTITIONED BY properties)

Be aware, that CrateDB temporarily might need more storage during ingest, but this can significantly drop after some merges and cleanups happened.

Oh my …well I wasn’t aware that each partition has its own replica setting…no that wasn’t intended (at least for ingestion…I am not quite sure if it is improving query performance)

Thats the DDL of that table

CREATE TABLE odp_timeseries (
     entity_id text, -- primary key,
     entity_type text,
     time_index timestamp with time zone,
     source text, -- not null default 'DefaultNone',
     value text,
     value_num double precision,
     category text,
     location_index text,
     location geo_shape,
     location_centroid geo_point,
     parent text,
     content object(ignored),
     year timestamp with time zone GENERATED ALWAYS AS extract(year from time_index),
     hod integer GENERATED ALWAYS AS extract(hour from time_index)
) CLUSTERED BY (hod) INTO 6 SHARDS PARTITIONED BY (year, entity_type, source)

“content” holds an JSON representation of that row data which might be the largest data asset here, but the JSON is also quite short and compareable to the JSONs in other partitions.

“location” are UBER H3 Tiles, so there are only Hexagons (6+1 Coordinate Polygon) stored in each row…nothing quite big

And you are right…Size of the partition is dropping slowly after ingest…right now it has dropped to 13,4GB (hopefully it will drop even further)

Just for a general reference oversharding / overpartitioning is considered bad practice. Typically a single shard can easily hold between 10 to 50 GiBs of data without any significant performance impacts (also see Sharding and Partitioning Guide for Time Series Data)

→ a partition of 6 shards is fine to hold ~300 GiBs of data.

Also specifying CLUSTERED BY(hod) probably prevents data to be probably distributed across shards. If less shards are used, probably also the performance is worse.

I would suggest to do the following:


10 shards, so that each node ideally gets assigned 2 and partition only by year

1 Like

BTW you probably want to use an integer data type for year or use date_trunc() instead of extract()


year integer GENERATED ALWAYS AS extract(year from time_index),


year timestamp with time zone GENERATED ALWAYS AS date_trunc('year', time_index),

Thank you very much for those insights! Yes, a lot of those settings are only the result of me experimenting with it…the “hod” is one of them I wasn’t sure if that is very clever…I’ll abandon this.

Is there a way to set the replicas for all partitions?
Do they just inherit from the table on creation of a partition?
Do they also inherit shards from the table definition? Wouldn’t 10 shards per partition be way too much then as my largest partition holds 4.4GB (except the new one which is 14GB)?

Is there a way to set the replicas for all partitions?

yes, on table creation or with ALTER TABLE parted_table SET (number_of_replicas ='1-2');

Do they just inherit from the table on creation of a partition?

On the creation of a partition the currently set value of the table is used. However each partition can have a different replication level. E.g. older (not so important) partitions could have less or no replicas.

Do they also inherit shards from the table definition? Wouldn’t 10 shards per partition be way too much then as my largest partition holds 4.4GB (except the new one which is 14GB)?

yes, but this can also be changed. There is no definitive answer. If your indexing to be distributed across the cluster it is best to have a multiple of the number of nodes for your count of shards. AS mentioned above a different sharding strategy (e.g. only by year) would also help

1 Like

ok…the ALTER TABLE …changed the replica setting of my parted_table, but left the value on the partitions untouched… guess I will have to change it manually on each partition.

Thank you very much… A lot to do for me based on this :wink:

ALTER TABLE tab_name alters all partitions new and old
ALTER TABLE ONLY tab_name alters only new partitions
ALTER TABLE tab_name PARTITION partition_column = value alters a specific partition

1 Like

one last question to this specific topic: the table is often queried like where entity_type = ... or where source = ... which is the major reason why these two fields are used in the partitioning scheme.

If I partition only by year, is there any possibility to improve query performance for these cases where you would make use of an index in classic databases?

If not explicitly turned off every field is indexed anyway (inverted index for text fields, kdb-trees for numeric values).