I’m encountering weird behaviour with COPY FROM when attempting to import pretty small CSV file to cluster with 3 nodes (version: 4.1.3). Cluster is running in AWS EKS on 3 instances of m5.2xlarge. Each pod has 7 cpu’s and 26 gigabytes of memory divided 50%/50% for crate and lucene.
File has around 55 million rows and gzipped size is around 200mb.
I have created table like this:
CREATE TABLE result (
device integer,
time timestamp with time zone,
idx smallint,
count smallint
)
clustered by (device) into 12 shards
WITH (
number_of_replicas = 0,
refresh_interval = 0
)
and running following command to import file from S3:
copy result from 's3://secret:secret@bucket/file_9_with_header.gz' WITH (format='csv', compression='gzip', shared='true', num_readers='1')
What then happens is that crate starts one job every minute or so. If I query running jobs:
select * from sys.jobs order by started asc limit 100;
In few minutes it looks like this:
id | node | started | stmt | username |
---|---|---|---|---|
be842aaf-dd0c-4af2-8cb5-107ec7e3dfbf | {“name”:“Busazza”,“id”:“l2OWa8OJRxWsiVarjF5Hpw”} | 1584101060977 | copy single_result from ‘s3://secret:secret@bucket/file_9_with_header.gz’ WITH (format=‘csv’, compression=‘gzip’, shared=‘true’, num_readers=‘1’) | crate |
2907bcb7-5c6d-45d3-8630-0fe3b7c23112 | {“name”:“Busazza”,“id”:“l2OWa8OJRxWsiVarjF5Hpw”} | 1584101120535 | copy single_result from ‘s3://secret:secret@bucket/file_9_with_header.gz’ WITH (format=‘csv’, compression=‘gzip’, shared=‘true’, num_readers=‘1’) | crate |
6954cd0d-99c0-46ec-a336-86bff85074a1 | {“name”:“Busazza”,“id”:“l2OWa8OJRxWsiVarjF5Hpw”} | 1584101181028 | copy single_result from ‘s3://secret:secret@bucket/file_9_with_header.gz’ WITH (format=‘csv’, compression=‘gzip’, shared=‘true’, num_readers=‘1’) | crate |
6090be76-7293-4f8e-8947-96bcef4e114f | {“name”:“Busazza”,“id”:“l2OWa8OJRxWsiVarjF5Hpw”} | 1584101240535 | copy single_result from ‘s3://secret:secret@bucket/file_9_with_header.gz’ WITH (format=‘csv’, compression=‘gzip’, shared=‘true’, num_readers=‘1’) | crate |
a49f089e-603a-4983-9b4d-e71695f542d3 | {“name”:“Busazza”,“id”:“l2OWa8OJRxWsiVarjF5Hpw”} | 1584101300535 | copy single_result from ‘s3://secret:secret@bucket/file_9_with_header.gz’ WITH (format=‘csv’, compression=‘gzip’, shared=‘true’, num_readers=‘1’) | crate |
98e22fc4-c096-4837-b1a4-01e8c4930667 | {“name”:“Busazza”,“id”:“l2OWa8OJRxWsiVarjF5Hpw”} | 1584101420535 | copy single_result from ‘s3://secret:secret@bucket/file_9_with_header.gz’ WITH (format=‘csv’, compression=‘gzip’, shared=‘true’, num_readers=‘1’) | crate |
7c863dfe-95f1-4543-92e9-3c6615ffe172 | {“name”:“Busazza”,“id”:“l2OWa8OJRxWsiVarjF5Hpw”} | 1584101480535 | copy single_result from ‘s3://secret:secret@bucket/file_9_with_header.gz’ WITH (format=‘csv’, compression=‘gzip’, shared=‘true’, num_readers=‘1’) | crate |
f5f3e260-b721-4837-a46e-e2411be030d5 | {“name”:“Busazza”,“id”:“l2OWa8OJRxWsiVarjF5Hpw”} | 1584101540536 | copy single_result from ‘s3://secret:secret@bucket/file_9_with_header.gz’ WITH (format=‘csv’, compression=‘gzip’, shared=‘true’, num_readers=‘1’) | crate |
and the import never completes but exponentially start using more cpu.
Also, there doesn’t seems to be any errors in the logs of the cluster nodes.
Any ideas how to get this working?