This tutorial is also available in video at CrateDB Video | Fundamentals: Importing and Exporting Data in CrateDB
This tutorial presents the basics of COPY FROM
and COPY TO
in CrateDB. For in-depth details of CrateDB COPY FROM
, refer to our Fundamentals of the COPY FROM statement post.
Before anything else, I demonstrate how to link my local file system to the Docker container running CrateDB. This step is only necessary if you are running CrateDB with Docker. Then, I will show you how to import JSON
and CSV
data to CrateDB using the COPY FROM
statement. Finally, I show you how to export data from CrateDB to a local file system using the COPY TO
statement.
Importing Data - JSON
Importing Data in CrateDB is done with the COPY FROM
statement.
Reading the COPY FROM documentation, I see that COPY FROM
accepts both JSON
and CSV
inputs, and in this tutorial, I will show you how to do it with both, starting with JSON
. You can refer to the documentation at the crate.io website to get more details on COPY FROM
.
But before getting to the dataset, let me quickly explain how to link local files to CrateDB, in case you are running CrateDB with Docker - as I am.
Linking local files to Docker container
To put it simply: the Docker container is the running environment for CrateDB. This means that CrateDB does not have direct access to my local filesystem, but rather to the ones given in the container.
So when I want to import files from my local files to CrateDB, I add these files to CrateDB’s Docker container with the --volumes
flag, which has the following structure:
--volume=/path/on/your/machine/to/file.json:/path/in/docker
I create a folder in my machine called my_datasets
, where I will store the JSON file and link this folder to a docker_datasets
folder in the Docker container.
To apply these changes to my CrateDB setup, I stop Docker with CrateDB and start it again with the following command:
docker run --publish=4200:4200 --publish=5433:5432 --volume=/Users/rafaelasantana/my_datasets:/docker_datasets --env CRATE_HEAP_SIZE=1g crate:latest
From now on, I can add my datasets to my local my_datasets
folder, which will be accessible from CrateDB!
Dataset Overview
The first dataset I’m using today is a small collection of five quotes formatted in JSON
, each having the Quote
, Author
, Tags
, Popularity
, and Category
keys.
If you want to import your dataset, ensure it’s a single JSON OBJECT
per line and no commas separate different objects. You can find more information on the formatting in our COPY FROM documentation.
Creating a table to store data
I set up a table in CrateDB to store the single_lined.json
data. I take the object’s keys as the table columns, so it looks like this:
CREATE TABLE dataset_quotes (
"Quote" TEXT,
"Author" TEXT,
"Tags" ARRAY(TEXT),
"Popularity" DOUBLE,
"Category" TEXT
);
And then, I run the following COPY FROM
statement, which imports single_lined.json
into the dataset_quotes
table. It’s worth saying that the folder path is from Docker, for the docker_datasets
folder, and not my local file path.
Also, adding RETURN SUMMARY
to the end of my queries gives me detailed error reporting in case something would not work as expected.
COPY dataset_quotes
FROM '/docker_datasets/single_lined.json'
RETURN SUMMARY;
Now that I successfully imported JSON
data into CrateDB let’s quickly check out how it works with CSV
.
Importing Data - CSV
I use a dataset from Kaggle with around 500k quote records for the CSV
data. It consists of three columns: the quote, the author of the quote, and the category tags for that quote.
I download the dataset, name it quote_dataset.csv
and save it in that same my_datasets
folder, which is mounted in Docker and accessible from CrateDB.
Then, I open the dataset to check the column names and data types.
I see there are the quote
, author
, and category
columns, all having TEXT
data.
So I copy these headers and create a table in CrateDB with the same columns.
CREATE TABLE csv_quotes (
quote TEXT,
author TEXT,
category TEXT
);
Now, all it is left is to run the COPY FROM
statement to import the data into CrateDB.
COPY csv_quotes
FROM '/docker_datasets/quote_dataset.csv'
RETURN SUMMARY;
Here, CrateDB reports five errors in this immense import, probably due to formatting issues within the dataset.
Most importantly, I see that CrateDB ingested close to 500k rows in seconds!
I query the csv_quotes
in the Table Browser and get a glimpse of the imported data.
Exporting Data
Finally, I can easily export data from CrateDB using the COPY TO
statement.
In the COPY TO
documentation, I read:
"The COPY TO
command exports the contents of a table to one or more files into a given directory with unique filenames. Each node with at least one table shard will export its contents onto its local disk.
The created files are JSON formatted and contain one table row per line and, due to the distributed nature of CrateDB, will remain on the same nodes where the shards are."
I head to the Shards Tab in Admin UI and see that CrateDB distributed my tables into shards 0, 1, 2, and 3.
So, I expect CrateDB to export each of these shards’ data into an individual file.
I copy the csv_quotes
table content to the docker_dataset
folder. Since the Docker folder is linked to my local my_datasets
folder, CrateDB will successfully export the values to my_datasets
.
COPY csv_quotes TO DIRECTORY '/docker_datasets/';
Now I see four files (one for each CrateDB shard) in my local my_datasets
folder, containing the exported data in JSON format.
For example, I open csv_quoes_1_.json
and see that each row was formatted as a JSON object with quote
, author
, and category
keys.
With that, we come to the end of this tutorial on importing and exporting data in CrateDB.
Reference
- COPY FROM (documentation) COPY FROM — CrateDB: Reference
- COPY TO (documentation) COPY TO — CrateDB: Reference
- Quotes Dataset Quotes- 500k | Kaggle