Importance of Importing data in order

djbestenergy · October 6, 2022, 9:03am

Hello all,

This is possibly a self answering question about the order of data insertion from a migration perspective and how this could affect performance.

We’ve got 100s GB of data that will need importing that are currently held in individual “day” data files for (n) IoT devices with each day having 1440 records in timestamp order.

These will need importing into Crate but just wanted to know if I could run several imports at once, where the data will be out of order ( due to the differing processes inserting different data files and times ) or would a single process reading data in order ( but vastly slower) work better.

i.e. could query performance be affected by out of order data.

I might be barking up the wrong tree, but I don’t know how “flexible” CrateDB is regarding this.

Many thanks in advance,
David.

hernanc · October 6, 2022, 11:01am

Hi David,
The way CrateDB stores data with partitioning, sharding, and distribution among nodes, makes so that it can be a lot faster ingesting data that arrives “out of order” compared to other systems, so I think processing the files in parallel would be a good idea, 1440 records is not much so you should not need additional batching, but you should do some testing importing different number of files in parallel as there will be a point if you run too many requests in parallel where the system will get overwhelmed and the import throughput will actually go down.
Depending on the partitioning columns, sharding routing keys, primary keys defined, indexed columns, and how out of order the records arrive with respect to all these, there may be some fragmentation leading to additional disk space being consumed and query performance being suboptimal. To maximise query performance after the data is loaded I would suggest running the Optimization — CrateDB: Reference process.

Topic		Replies	Views
Inserting billions of rows the hard way CrateDB	15	2321	April 6, 2021
Partition requires significantly a lot more space than the others CrateDB	10	1174	October 26, 2021
Client interface to use for faster data ingestion CrateDB	1	683	May 13, 2020
Table reindexing after version upgrade CrateDB	7	701	October 4, 2022
Import data into crateDB table from multiple json files SQL	2	435	February 3, 2023

Importance of Importing data in order

Related topics