»Loading data into CrateDB« weekly edition

Introduction

We are currently unlocking data loading into CrateDB using the excellent ingestr toolkit, based on dlt [1]. This topic informs about the progress and gives everyone the chance to participate early in the development.

Prerequisites

For executing the commands in this walkthrough, you need a working installation of Docker or Podman and a Python installation on your machine. For installing Python packages, we recommend to use the uv package manager [2].

Call for support

Because relevant data adapters are still in their infancy, we will very much appreciate receiving feedback in form of bug reports, suggestions for improvements, or success notes.

Other options

CrateDB also provides integrations for many other ETL applications and frameworks.


  1. The CrateDB destination adapter for ingestr uses dlt via dlt-cratedb. ↩︎

  2. The uv package manager can easily be installed using pip or pipx, e.g. pipx install uv. It also offers other installation methods. ↩︎

Loading data from Amazon Kinesis

Synopsis

ctk load \
    "kinesis:?aws_access_key_id=test&aws_secret_access_key=test&region_name=us-east-1&table=demo" \
    "crate://crate:crate@localhost:4200/testdrive/kinesis"

Documentation

Loading data from Apache Kafka

Synopsis

ctk load \
    "kafka:?bootstrap_servers=localhost:9092&group_id=test&table=demo" \
    "crate://crate:crate@localhost:4200/testdrive/kafka"

Documentation

Loading data from Databricks SQL warehouses

Synopsis

ctk load \
    "databricks://token:<access_token>@<instance>.cloud.databricks.com:443/?http_path=/sql/1.0/warehouses/<warehouse>&catalog=samples&table=accuweather.forecast_hourly_metric" \
    "crate://crate:crate@localhost:4200/testdrive/accuweather_forecast_hourly_metric"

Documentation

Loading data from SAP HANA

Synopsis

ctk load \
    "hana://SYSTEM:HXEHana1@localhost:39017/SYSTEMDB?table=sys.adapters" \
    "crate://crate:crate@localhost:4200/testdrive/hana_sys_adapters"

Documentation

Apache Iceberg and Delta Lake (load and save)

Hi again. We recently added I/O adapters for Apache Iceberg tables [1] and Delta Lake tables [2] following our aims to enhance interoperability with open table formats.

Both are open table formats that build upon Apache Parquet data files, a free and open-source column-oriented data storage format, effectively succeeding and superseding Apache Hive use cases from the Hadoop era.

CrateDB Toolkit now provides adapters to import and export data into/from those open table formats. Please let us know if you can discover any flaws and don’t hesitate to share any ideas for improvement. Thank you in advance. :folded_hands:

Synopsis

uv tool install --upgrade 'cratedb-toolkit[iceberg,deltalake]'
ctk load \
    "s3+iceberg://bucket1/demo/taxi-tiny/metadata/00003-dd9223cb-6d11-474b-8d09-3182d45862f4.metadata.json?s3.access-key-id=<your_access_key_id>&s3.secret-access-key=<your_secret_access_key>&s3.endpoint=<endpoint_url>&s3.region=<s3-region>" \
    "crate://crate:crate@localhost:4200/demo/taxi-tiny"
ctk load \
    "s3+deltalake://bucket1/demo/taxi-tiny?AWS_ACCESS_KEY_ID=<your_access_key_id>&AWS_SECRET_ACCESS_KEY=<your_secret_access_key>&AWS_ENDPOINT=<endpoint_url>&AWS_REGION=<s3-region>" \
    "crate://crate:crate@localhost:4200/demo/taxi-tiny"

Documentation


  1. Iceberg is a specification and high-performance format for huge analytic tables, making it possible for engines like Spark, Trino, Flink, Presto, Hive and Impala to safely work with the same tables, at the same time. Apache Iceberg is its reference implementation. ↩︎

  2. Delta Lake (paper) is the optimized storage layer that provides the foundation and default format for all table operations on Databricks. It was developed for tight integration with Structured Streaming, allowing you to easily use a single copy of data for both batch and streaming operations and providing incremental processing at scale. ↩︎

Loading data from Elasticsearch

Synopsis

ctk load \
    "elasticsearch://localhost:9200?secure=false&table=taxi_details" \
    "crate://crate:na@localhost:4200/testdrive/taxi_details"

Documentation