tldr; inserts are slow. how to speed up?
I have roughly 33 billion records to import into a crate db cluster. Right now, my ETL/MapReduce process cuts the sheer data size down from 18TB to roughly 4TB. The parsed data is stored as parquet files (roughly 6,500 files totaling 10GB). I see crate doesn’t support parquet import, so I wrote some code in julia to pull the data from the parquet files and stuff it into crate via the postgres wire protocol. I have two mid-sized servers (64 CPUs, 512GB ram, 40TB NVMe) in the cluster right now. A third is on the way.
Even with threaded async connections, I can’t seem to make crate insert more than 287 rows/second. It doesn’t seem to make a difference if I load the entire dataset from the parquet file into memory and dump as a transaction vs just insert each row one by one.
At this rate, billions of rows are going to take weeks. Is there a faster way?
Any chance there will be support for parquet as a source format? Anything non-text based would be great for mid-sized datasets.