What are the compression performance and charasteristics compared to TimescaleDB?

Hello, I’ve read Compare CrateDB | CrateDB vs TimescaleDB | Crate.io and Comparing databases for an Industrial IoT use-case: MongoDB, TimescaleDB, InfluxDB and CrateDB - CrateDB but I’m still a little confused about the compression performance and characteristics of CrateDB as compared to TimescaleDB.

We’re evaluating our on-premise 3-node (multi-node) TimescaleDB cluster for our time series use cases, and for our data sets (100s millions of rows, a few months of data, storing many sensor values per device per second), after enabling compression policy in TimescaleDB, we’ve seen 7x to 10x compression, in other words a remarkable disk space saving! As far as I can understand, TimescaleDB achieves this by chunking a user-defined window of data into buckets, then running appropriate compression algorithms based on the data type of the columns.

Is there some more detailed information regarding the compression performance of CrateDB? Is it possible to have CrateDB automatically compress data older than some user-defined period?

And if the answer yes, then are there things that developers should be careful about with respect to inserting time series data into older data ranges? (E.g. because the data is compressed, it might require more work to insert, update, etc.)

1 Like

CrateDB by default uses LZ4 as compression for document sources. Further doc values (columnar stores) are delta-encoded, bit-packing and GCD compressed. Tables can also use deflate instead of LZ4 to reduce the storage requirements even further. All that without any “hackery” of combining multiple rows in arrays and using standard encoding techniques like delta-delta-enoding.

Some simple test with some Timescale-provided data sets showed, that CrateDB typically performs better than Timescale without chunk compressing. That being said, that data was rather optimised for delta-delta-encoding and xor-compression.

CREATE TABLE readings (
    device_id  TEXT,
    battery_level  DOUBLE PRECISION,
    battery_status  TEXT,
    battery_temperature  DOUBLE PRECISION,
    bssid  TEXT,
    cpu_avg_1min DOUBLE PRECISION,
    cpu_avg_5min DOUBLE PRECISION,
    cpu_avg_15min DOUBLE PRECISION,
    mem_free DOUBLE PRECISION,
    mem_used DOUBLE PRECISION,
    ssid  TEXT

Index on device and time for all databases

That being said, with chunk compression enabled Timescale is more efficient (~6GB in this case) for use-cases with very few indexes, but this quickly turns around with higher cardinality data and the use of more indexes. also you can achieve many of the compression characteristics of Timescale by simple using arrays in CrateDB (remember it is mostly a document store). So if you really want to squeeze your data a bit more, most of the timescale “magic” can be achieved by moving data into a second table and using arrays. Also see Optimizing storage for historic time-series data

Another valid option is to use the snapshot mechanism of CrateDB and move old partitions of data further compressed using gzip to a low cost blob-store like S3 or Azure Storage. If you needed that the again, you can restore it with a one-liner

There might arise the question, if the techniques used are so simple, why we just don’t integrate it and the short answer is, that we want to offer our users flexibility how to use CrateDB. The compressions techniques used by timescale limit you what and how you can store it. E.g. JSON data can’t be compressed using those techniques. Further updates, deletes, inserts in compressed chunks are limited.

Nevertheless we ares looking into storage optimisation more and recent results are promising, saving quite a lot of storage without any real performance caveats.