Rows need significantly more storage as others in the same table and pretty much the same content

I already mentioned this in another Topic, then thougth it is solved, but still the issue is emminent and I tried to better circumscribe on this.

I have a table with the following scheme:

CREATE TABLE doc.odp_ts (
     entity_id text, -- primary key,
     entity_type text,
     time_index timestamp with time zone,
     source text,
     value text,
     value_num double precision,
     category text,
     location_index text,
     location geo_shape,
     location_centroid geo_point,
     parent text,
     content object(ignored),
     year integer GENERATED ALWAYS AS extract(year from coalesce(time_index,CURRENT_TIMESTAMP))

In location there is a polygon representation of a UBER H3 Hexagon, so always only 7 vertices.

In content there is a JSON representation of the data of this row. Its a little bit more comprehensive, but EVERY (!) rows json is nearly the same size. There isn’t much variation in the size of these JSONs

In source there is either ‘aaa’ or ‘bbb’…you guess it? It tells how the rows were imported:

  • The ‘aaa’ were imported via COPY FROM... and then through some ELT Queries cleaned and INSERTED into this table
  • The ‘bbb’ were INSERTED directly via the HTTP Endpoint`s bulk_copy

If I delete all ‘bbb’ rows and keep only the roughly 8.8 Mio. ‘aaa’ rows, the table has a size of 1.5GB
If I delete all ‘aaa’ rows and keep only roughly 9 Mio. ‘bbb’ rows, the table has a size between 30 and 50 GB (I got different outcomes in my experiements).

These sizes had a long time to sit in, so I doubt there are any table optimizations, merges and so on left to bring that further down. The problem is not only size, but also Query time if the result set intersects on any of these heavy-weight rows.

I am assuming that this has something to do with how the index was built, as the index would be the only thing I could’t compare so far (the rows totally hold compareable data) and this would also explain why queries are significantly slower on these rows.

Is there any possibility to check/repair/recreate the index? Just copying the rows to a new table didn’t work so far.

That massive difference seems really strange, as some segment merge mechanics are not working correctly.

There is not really a differentiation between index and data storage. CrateDB is using Lucene as storage (and indexing) layer. Reindexing a table can be done using COPY FROM/TO or INSERT INTO SELECT.

Could you manually trigger an



OPTIMIZE TABLE doc.odp_ts WITH (only_expunge_deletes = true)

after you deleted the respective rows. Be aware, that in CrateDB, a row is not deleted from a segment, just marked as deleted. During a merge process of segments, a new segment is created that does not have those deletes.

1 Like

I`ve inserted those rows into a new table, tried OPTIMIZE (on that particular partition for sure) and also with only_expunge_deletes… Crate Console writes “OPTIMIZE ok” but without any effect. The rows are still humongous and still heavy to query.

Fortunately I have to replace them by another import anyway…hope the next try via HTTP bulk_insert will produce healthy rows.

When you do a

  • is there a difference between an ‘aaa’ row and a ‘bbb’ row?
  • how many shards are created for ‘aaa’ and ‘bbb’ entries? Is data distributed evenly between them?

If you are not directly comparing COPY FROM and bulk inserts, but have transformations step in between, there should be no difference in the actual storage

Thats the _raw content of a bbb row


and here`s one aaa row:


So, no…there isn’t really any difference. The particular table I have them now has 3 Partitions (for each year 2019, 2020, 2021) each with 10 Partitions (5 Nodes with 2 CPUs each).In “Shards” view the data looks evenly distributed. Those rows also are situated in the same shards (the distinctive attribute between those rows isn’t distincting partitions)

Would it maybe be possible to get some example data for this, or a snapshot of a big and small partition. This sound really strange.

I’ve created two new tables “_bloat” and “_sleek” with the exact same schema… copied 10.000 of those datasets to each of these tables… the sleek rows were copied in 22seconds, the bloated ones in 1m25sec (!).

Also the tables are very different in size as you can see here


I’ll try to export those as csvs and deliver the schema and the data to you. Should also be possible that I snapshot those two tables to a different storageaccount and deliver you the credentials, but it will take me some time I’m afraid.

Got some news here: The significant column seems to be the location field…if I insert those rows without location (NULL), the resulting table is normal in size and behaviour.

I still don’t know what exactly is the difference as also the lighter rows have location in the very same manner. Some kind of encoding or spatial index type of problem?

I was just to write something along the line :slight_smile:

I still don’t know what exactly is the difference as also the lighter rows have location in the very same manner. Some kind of encoding or spatial index type of problem?

Yes, it really looks like this is related to the spatial index.
I see that you used a geohash index. I did some first tests with quadtrees and it seems to significantly perform better. Also I would lower the amount of shards to 1 per node

With quadtree and 1 shard, the total size is 14.6 MiB ( vs 57.7MiB + 2.6 MiB bloat/sleek with geohash)

CREATE TABLE IF NOT EXISTS "doc"."odp_timeseries_quad" (
   "entity_id" TEXT,
   "entity_type" TEXT,
   "source" TEXT,
   "value" TEXT,
   "value_num" DOUBLE PRECISION,
   "category" TEXT,
   "location_index" TEXT,
   "location" GEO_SHAPE INDEX USING quadtree,
   "location_centroid" GEO_POINT,
   "parent" TEXT,
   "content" OBJECT(IGNORED),
   "year" INTEGER GENERATED ALWAYS AS EXTRACT(YEAR FROM coalesce("time_index", current_timestamp))

Thanks for the comprehensive analysis, now I also know, why the one datasource is different to the other. The one datasource only stores Uber H3 Hexagons of a certain level which seems to intersect very well with the Geohash cells resulting in a compact index. The other datasource stores H3 Index cells of different Levels/Sizes and is more heterogenous in that sense, which seems to overlap Geohash cells in a very disadvantagous manner resulting in a bloated index which needs a lot of space and performs bad on querying.

We will end up banning this location field completely from our multi-million row timeseries table as it doesn’t make sense to store those all equal polygons anyway.

Instead we will put them into a simple, small lookup table where I also will make use of quadtree index as you showed I think :wink:

Thank you very much!