Hello,
Coming from PostgreSQL background, I’m a bit confused about exporting data out of CrateDB using COPY TO
command. I’m following the instructions at https://crate.io/docs/crate/reference/en/4.6/sql/statements/copy-to.html.
I’m running a 3-node CrateDB cluster on a Debian 10 Linux servers. The CrateDB version is 4.6.3.
I have a simple table for this experiment:
cr> SHOW CREATE TABLE emre.test_table_002 ;
+------------------------------------------------------+
| SHOW CREATE TABLE emre.test_table_002 |
+------------------------------------------------------+
| CREATE TABLE IF NOT EXISTS "emre"."test_table_002" ( |
| "time" TIMESTAMP WITH TIME ZONE, |
| "location_id" TEXT, |
| "device_id" TEXT, |
| "temperature" DOUBLE PRECISION, |
| "humidity" DOUBLE PRECISION, |
| PRIMARY KEY ("time", "location_id", "device_id") |
| ) |
| CLUSTERED INTO 6 SHARDS |
| WITH ( |
| "allocation.max_retries" = 5, |
| "blocks.metadata" = false, |
| "blocks.read" = false, |
| "blocks.read_only" = false, |
| "blocks.read_only_allow_delete" = false, |
| "blocks.write" = false, |
| codec = 'default', |
| column_policy = 'strict', |
| "mapping.total_fields.limit" = 1000, |
| max_ngram_diff = 1, |
| max_shingle_diff = 3, |
| number_of_replicas = '0-1', |
| "routing.allocation.enable" = 'all', |
| "routing.allocation.total_shards_per_node" = -1, |
| "store.type" = 'fs', |
| "translog.durability" = 'REQUEST', |
| "translog.flush_threshold_size" = 536870912, |
| "translog.sync_interval" = 5000, |
| "unassigned.node_left.delayed_timeout" = 60000, |
| "write.wait_for_active_shards" = '1' |
| ) |
+------------------------------------------------------+
SHOW 1 row in set (0.013 sec)
The table contains the following data:
cr> SELECT * FROM emre.test_table_002 ;
+---------------+-------------+------------+-------------+----------+
| time | location_id | device_id | temperature | humidity |
+---------------+-------------+------------+-------------+----------+
| 1633050000000 | Dendermonde | device_001 | 12.1 | 40.3 |
| 1633050000000 | Dendermonde | device_002 | 13.1 | 41.3 |
| 1633050000000 | Wilrijk | device_002 | 13.1 | 41.3 |
+---------------+-------------+------------+-------------+----------+
SELECT 3 rows in set (0.037 sec)
On another Linux machine (non-CrateDB) using crash
utility I’m connected to one of the CrateDB nodes, and I run the COPY TO command:
cr> COPY emre.test_table_002 TO DIRECTORY '/tmp/' ;
COPY OK, 3 rows affected (0.050 sec)
Then I go and check the /tmp
directories of each CrateDB node.
crate-dn-001:~$ ls -lah /tmp/*.json
-rw-r--r-- 1 crate crate 111 Nov 12 12:41 /tmp/test_table_002_1_.json
-rw-r--r-- 1 crate crate 0 Nov 12 12:41 /tmp/test_table_002_2_.json
-rw-r--r-- 1 crate crate 0 Nov 12 12:41 /tmp/test_table_002_4_.json
-rw-r--r-- 1 crate crate 0 Nov 12 12:41 /tmp/test_table_002_5_.json
crate-dn-002:~$ ls -lah /tmp/*.json
ls: cannot access '/tmp/*.json': No such file or directory
crate-dn-003:~$ ls -lah /tmp/*.json
-rw-r--r-- 1 crate crate 107 Nov 12 12:41 /tmp/test_table_002_0_.json
-rw-r--r-- 1 crate crate 111 Nov 12 12:41 /tmp/test_table_002_3_.json
My confusion is related to the following questions:
- Six JSON files scattered to two nodes: is this because of shards?
- Why some empty JSON files (0 bytes)? I mean… you can say “empty shards”, but I still find empty JSON files confusing.
And a few questions more in the category of wishlist / roadmap:
- Do you plan to add support for exporting to non-JSON formats, e.g. CSV?
- Do you plan to add support for a single export file per node? I mean, is there a reason for not exporting to a single JSON file e.g. on
crate-dn-001
? What’s the motivation behind mapping number of shards to number of exported files on a given node? - Do you plan to add support for custom file names for exported files, at least having the ability to add some prefix / suffix to the name generated by CrateDB?