»Loading data into CrateDB« weekly edition

Introduction

We are currently unlocking data loading into CrateDB using the excellent ingestr toolkit, based on dlt [1]. This topic informs about the progress and gives everyone the chance to participate early in the development.

Prerequisites

For executing the commands in this walkthrough, you need a working installation of Docker or Podman and a Python installation on your machine. For installing Python packages, we recommend to use the uv package manager [2].

Call for support

Because relevant data adapters are still in their infancy, we will very much appreciate receiving feedback in form of bug reports, suggestions for improvements, or success notes.


  1. The CrateDB destination adapter for ingestr uses dlt via dlt-cratedb. ↩︎

  2. The uv package manager can easily be installed using pip or pipx, e.g. pipx install uv. It also offers other installation methods. ↩︎

Loading data from Amazon Kinesis

Prerequisites

The tutorial uses LocalStack to spin up a local instance of Amazon Kinesis so you don’t need an AWS account to exercise it.

Install

Install the most recent Python packages awscli, crash, and ingestr.

uv tool install --upgrade awscli crash ingestr

Tutorial

Services

Run Kinesis from LocalStack and CrateDB using Docker or Podman.

docker run --rm --name=localstack \
  --publish=4566:4566 \
  docker.io/localstack/localstack:latest
docker run --rm --name=cratedb \
  --publish=4200:4200 --publish=5432:5432 --env=CRATE_HEAP_SIZE=2g \
  docker.io/crate:latest '-Cdiscovery.type=single-node'

Configure AWS clients

LocalStack’s default region is us-east-1. Let’s use it.

export AWS_DEFAULT_REGION=us-east-1
export AWS_ENDPOINT_URL="http://localhost:4566"

Load data

Create the Kinesis stream.

aws kinesis create-stream --stream-name=demo

Publish two data payloads to the Kinesis stream.

aws kinesis put-record \
  --stream-name=demo --partition-key default \
  --data '{"sensor_id":1,"ts":"2025-06-01 10:00","reading":42.42}'

aws kinesis put-record \
  --stream-name=demo --partition-key default \
  --data '{"sensor_id":2,"ts":"2025-06-01 11:00","reading":451.00}'

Transfer data

Use ingestr to load data from Kinesis stream into CrateDB table.

ingestr ingest --yes \
  --source-uri "kinesis://?aws_access_key_id=test&aws_secret_access_key=test&region_name=us-east-1" \
  --source-table "demo" \
  --dest-uri "cratedb://crate:crate@localhost:5432/?sslmode=disable" \
  --dest-table "kinesis.sensor_demo"

Query data

Submit queries to CrateDB using crash.

crash -c "SELECT count(*) FROM kinesis.sensor_demo"
crash -c "SELECT * FROM kinesis.sensor_demo"
crash -c "SHOW CREATE TABLE kinesis.sensor_demo"

Documentation

Other options

You can use ingestr to import data from a variety of other data sources. CrateDB also provides integrations for many other ETL applications and frameworks.

Loading data from Apache Kafka

Install

Install the most recent Python packages crash and ingestr.

uv tool install --upgrade crash ingestr

Tutorial

Services

Run Apache Kafka and CrateDB using Docker or Podman.

docker run --rm --name=kafka \
  --publish=9092:9092 docker.io/apache/kafka:4.0.0
docker run --rm --name=cratedb \
  --publish=4200:4200 --publish=5432:5432 --env=CRATE_HEAP_SIZE=2g \
  docker.io/crate:latest '-Cdiscovery.type=single-node'

Load data

Publish two events to a Kafka topic using kcat. [1]

echo '{"sensor_id":1,"ts":"2025-06-01 10:00","reading":42.42}' | \
  kcat -P -b localhost -t demo

echo '{"sensor_id":2,"ts":"2025-06-01 11:00","reading":451.00}' | \
  kcat -P -b localhost -t demo

Verify events are present by subscribing to the Kafka topic.

kcat -C -e -b localhost -t demo

Transfer data

Use ingestr to load data from Kafka topic into CrateDB table.

ingestr ingest --yes \
  --source-uri "kafka://?bootstrap_servers=localhost:9092&group_id=test" \
  --source-table "demo" \
  --dest-uri "cratedb://crate:crate@localhost:5432/?sslmode=disable" \
  --dest-table "kafka.sensor_demo"

Query data

Submit queries to CrateDB using crash.

crash -c "SELECT count(*) FROM kafka.sensor_demo"
crash -c "SELECT * FROM kafka.sensor_demo"
crash -c "SHOW CREATE TABLE kafka.sensor_demo"

Documentation

See also

Use kafka-compose.yml and kafka-demo.xsh for an end-to-end Kafka+CrateDB-in-a-box example rig using {Docker,Podman} Compose.

Other options

You can use ingestr to import data from a variety of other data sources. CrateDB also provides integrations for many other ETL applications and frameworks.


  1. You can install the Apache Kafka producer and consumer tool command kcat using {apt,brew} install kcat. ↩︎

Loading data from Databricks SQL warehouses

Prerequisites

You will need an account on https://databricks.com/.

Install

Install the most recent Python packages crash and ingestr.

uv tool install --upgrade crash ingestr

Tutorial

Databricks authentication information

For accessing Databricks SQL warehouses using ingestr, you need an access token, your server hostname, and the HTTP path (endpoint of the warehouse).

On the Databricks platform, in your settings dialog (top right corner), navigate to “User » Developer » Access tokens » Manage”, and select “Generate new token”. Then, copy the generated token as suggested on the result dialog (example: dapi367e0b1....), and use it in your ingestr command below.

https://<instance>.cloud.databricks.com/settings/user/developer

Within the primary navigation, navigate to “SQL » SQL Warehouses”, select the “Serverless Starter Warehouse” or any other, navigate to “Connection Details”, and use the values in “Server hostname” and “HTTP path” in your ingestr command below.

https://<instance>.cloud.databricks.com/sql/warehouses/<warehouse>/connectionDetails

Services

Run CrateDB using Docker or Podman.

docker run --rm --name=cratedb \
  --publish=4200:4200 --publish=5432:5432 --env=CRATE_HEAP_SIZE=2g \
  docker.io/crate:latest '-Cdiscovery.type=single-node'

Load data

The Databricks starter warehouse includes a few example tables. Let’s tap into and transfer two of them, then inspect their schema and data in CrateDB.

Transfer data

samples.nyctaxi.trips

Transfer data:

ingestr ingest --yes \
    --source-uri 'databricks://token:<access_token>@<instance>.cloud.databricks.com:443/?http_path=/sql/1.0/warehouses/<warehouse>&catalog=samples' \
    --source-table 'nyctaxi.trips' \
    --dest-uri 'cratedb://crate:crate@localhost:5432?sslmode=disable' \
    --dest-table 'databricks.nyctaxi_trips'

Query data:

crash -c 'SHOW CREATE TABLE databricks.nyctaxi_trips'
crash -c 'SELECT * FROM databricks.nyctaxi_trips LIMIT 5'

samples.accuweather.forecast_hourly_metric

Transfer data:

ingestr ingest --yes \
    --source-uri 'databricks://token:<access_token>@<instance>.cloud.databricks.com:443/?http_path=/sql/1.0/warehouses/<warehouse>&catalog=samples' \
    --source-table 'accuweather.forecast_hourly_metric' \
    --dest-uri 'cratedb://crate:crate@localhost:5432?sslmode=disable' \
    --dest-table 'databricks.accuweather_forecast_hourly_metric'

Query data:

crash -c 'SHOW CREATE TABLE databricks.accuweather_forecast_hourly_metric'
crash -c 'SELECT * FROM databricks.accuweather_forecast_hourly_metric LIMIT 5'

Documentation

Other options

You can use ingestr to import data from a variety of other data sources. CrateDB also provides integrations for many other ETL applications and frameworks.

Loading data from SAP HANA

Prerequisites

The tutorial uses SAP HANA express to spin up a local instance of HANA for evaluation purposes.

Install

Install the most recent Python packages crash and ingestr.

uv tool install --upgrade crash ingestr

Tutorial

Services

Run SAP HANA express and CrateDB using Docker or Podman.

docker run --rm --name=hana \
  -p 39013:39013 -p 39017:39017 -p 39041-39045:39041-39045 \
  -p 1128-1129:1128-1129 -p 59013-59014:59013-59014 \
  docker.io/saplabs/hanaexpress:latest \
  --master-password HXEHana1 \
  --agree-to-sap-license
docker run --rm --name=cratedb \
  --publish=4200:4200 --publish=5432:5432 --env=CRATE_HEAP_SIZE=2g \
  docker.io/crate:latest '-Cdiscovery.type=single-node'

Note: Starting HANA takes a while, it will only respond to port 39017 (system database) when log output says HANA is up, and to port 39041 (tenant database) when it says Startup finished!.

Pre-flight checks

Run basic connectivity check with system database.

docker exec -it hana bash -ic "hdbsql -i 90 -n localhost:39017 -u SYSTEM -p HXEHana1 'SELECT * FROM sys.dummy'"

Load data

We will select the built-in table sys.adapters, so the tutorial can save an extra step about how to import data into HANA.

Transfer data

Let’s connect to the system database because it is available earlier than the tenant database. Otherwise, address the tenant database using localhost:39041/HXE.

ingestr ingest --yes \
  --source-uri "hana://SYSTEM:HXEHana1@localhost:39017/SYSTEMDB" \
  --source-table "sys.adapters" \
  --dest-uri "cratedb://crate:crate@localhost:5432/?sslmode=disable" \
  --dest-table "hana.sys_adapters"

Query data

crash -c "SHOW CREATE TABLE hana.sys_adapters"
crash -c "SELECT * FROM hana.sys_adapters LIMIT 2"

Documentation

Other options

You can use ingestr to import data from a variety of other data sources. CrateDB also provides integrations for many other ETL applications and frameworks.

Apache Iceberg and Delta Lake (load and save)

Hi again. We recently added I/O adapters for Apache Iceberg tables [1] and Delta Lake tables [2] following our aims to enhance interoperability with open table formats.

Both are open table formats that build upon Apache Parquet data files, a free and open-source column-oriented data storage format, effectively succeeding and superseding Apache Hive use cases from the Hadoop era.

CrateDB Toolkit now provides adapters to import and export data into/from those open table formats. Please let us know if you can discover any flaws and don’t hesitate to share any ideas for improvement. Thank you in advance. :folded_hands:

Synopsis

uv tool install --upgrade 'cratedb-toolkit[iceberg,deltalake]'
ctk load table \
    "s3+iceberg://bucket1/demo/taxi-tiny/metadata/00003-dd9223cb-6d11-474b-8d09-3182d45862f4.metadata.json?s3.access-key-id=<your_access_key_id>&s3.secret-access-key=<your_secret_access_key>&s3.endpoint=<endpoint_url>&s3.region=<s3-region>" \
    --cluster-url="crate://crate:crate@localhost:4200/demo/taxi-tiny"
ctk load table \
    "s3+deltalake://bucket1/demo/taxi-tiny?AWS_ACCESS_KEY_ID=<your_access_key_id>&AWS_SECRET_ACCESS_KEY=<your_secret_access_key>&AWS_ENDPOINT=<endpoint_url>&AWS_REGION=<s3-region>" \
    --cluster-url="crate://crate:crate@localhost:4200/demo/taxi-tiny"

Documentation


  1. Iceberg is a specification and high-performance format for huge analytic tables, making it possible for engines like Spark, Trino, Flink, Presto, Hive and Impala to safely work with the same tables, at the same time. Apache Iceberg is its reference implementation. ↩︎

  2. Delta Lake (paper) is the optimized storage layer that provides the foundation and default format for all table operations on Databricks. It was developed for tight integration with Structured Streaming, allowing you to easily use a single copy of data for both batch and streaming operations and providing incremental processing at scale. ↩︎

Loading data from Elasticsearch

Install

Install the Python packages crash, httpie, and ingestr.

uv tool install crash httpie ingestr

Tutorial

Services

Run Elasticsearch and CrateDB using Docker or Podman.

docker run --rm --name=elasticsearch \
  --publish=9200:9200 --env=discovery.type=single-node \
  docker.elastic.co/elasticsearch/elasticsearch:7.17.29
docker run --rm --name=cratedb \
  --publish=4200:4200 --publish=5432:5432 --env=CRATE_HEAP_SIZE=2g \
  docker.io/crate:latest '-Cdiscovery.type=single-node'

Load data

Create Elasticsearch index.

http PUT http://localhost:9200/example

Acquire example data.

wget https://cdn.crate.io/downloads/datasets/cratedb-datasets/academy/chicago-data/taxi_details.csv

Import data into Elasticsearch.

ingestr ingest --yes \
  --source-uri "csv://taxi_details.csv" \
  --source-table "data" \
  --dest-uri "elasticsearch://localhost:9200?secure=false" \
  --dest-table "taxi_details"

Transfer data

Use ingestr to load data from Elasticsearch index into CrateDB table.

ingestr ingest --yes \
  --source-uri "elasticsearch://localhost:9200?secure=false" \
  --source-table "taxi_details" \
  --dest-uri "cratedb://crate:crate@localhost:5432" \
  --dest-table "elasticsearch.taxi_details"

Query data

Submit queries to CrateDB using crash.

crash -c "SHOW CREATE TABLE elasticsearch.taxi_details"
crash -c "SELECT count(*) FROM elasticsearch.taxi_details"
crash -c "SELECT * FROM elasticsearch.taxi_details"

Documentation

See also

Use elasticsearch-compose.yml and elasticsearch-demo.sh for an end-to-end Elasticsearch+CrateDB-in-a-box example ETL rig using {Docker,Podman} Compose.

Other options

You can use ingestr to import data from a variety of other data sources. CrateDB also provides integrations for many other ETL applications and frameworks.