This is possibly a self answering question about the order of data insertion from a migration perspective and how this could affect performance.
We’ve got 100s GB of data that will need importing that are currently held in individual “day” data files for (n) IoT devices with each day having 1440 records in timestamp order.
These will need importing into Crate but just wanted to know if I could run several imports at once, where the data will be out of order ( due to the differing processes inserting different data files and times ) or would a single process reading data in order ( but vastly slower) work better.
i.e. could query performance be affected by out of order data.
I might be barking up the wrong tree, but I don’t know how “flexible” CrateDB is regarding this.
Many thanks in advance,
The way CrateDB stores data with partitioning, sharding, and distribution among nodes, makes so that it can be a lot faster ingesting data that arrives “out of order” compared to other systems, so I think processing the files in parallel would be a good idea, 1440 records is not much so you should not need additional batching, but you should do some testing importing different number of files in parallel as there will be a point if you run too many requests in parallel where the system will get overwhelmed and the import throughput will actually go down.
Depending on the partitioning columns, sharding routing keys, primary keys defined, indexed columns, and how out of order the records arrive with respect to all these, there may be some fragmentation leading to additional disk space being consumed and query performance being suboptimal. To maximise query performance after the data is loaded I would suggest running the Optimization — CrateDB: Reference process.