Loading data from Databricks SQL warehouses
Prerequisites
You will need an account on https://databricks.com/.
Install
Install the most recent Python packages crash and ingestr.
uv tool install --upgrade crash ingestr
Tutorial
Databricks authentication information
For accessing Databricks SQL warehouses using ingestr, you need an access token, your server hostname, and the HTTP path (endpoint of the warehouse).
On the Databricks platform, in your settings dialog (top right corner), navigate to “User » Developer » Access tokens » Manage”, and select “Generate new token”. Then, copy the generated token as suggested on the result dialog (example: dapi367e0b1....), and use it in your ingestr command below.
– https://<instance>.cloud.databricks.com/settings/user/developer
Within the primary navigation, navigate to “SQL » SQL Warehouses”, select the “Serverless Starter Warehouse” or any other, navigate to “Connection Details”, and use the values in “Server hostname” and “HTTP path” in your ingestr command below.
– https://<instance>.cloud.databricks.com/sql/warehouses/<warehouse>/connectionDetails
Services
Run CrateDB using Docker or Podman.
docker run --rm --name=cratedb \
--publish=4200:4200 --publish=5432:5432 --env=CRATE_HEAP_SIZE=2g \
docker.io/crate:latest '-Cdiscovery.type=single-node'
Load data
The Databricks starter warehouse includes a few example tables. Let’s tap into and transfer two of them, then inspect their schema and data in CrateDB.
Transfer data
samples.nyctaxi.trips
Transfer data:
ingestr ingest --yes \
--source-uri 'databricks://token:<access_token>@<instance>.cloud.databricks.com:443/?http_path=/sql/1.0/warehouses/<warehouse>&catalog=samples' \
--source-table 'nyctaxi.trips' \
--dest-uri 'cratedb://crate:crate@localhost:5432?sslmode=disable' \
--dest-table 'databricks.nyctaxi_trips'
Query data:
crash -c 'SHOW CREATE TABLE databricks.nyctaxi_trips'
crash -c 'SELECT * FROM databricks.nyctaxi_trips LIMIT 5'
samples.accuweather.forecast_hourly_metric
Transfer data:
ingestr ingest --yes \
--source-uri 'databricks://token:<access_token>@<instance>.cloud.databricks.com:443/?http_path=/sql/1.0/warehouses/<warehouse>&catalog=samples' \
--source-table 'accuweather.forecast_hourly_metric' \
--dest-uri 'cratedb://crate:crate@localhost:5432?sslmode=disable' \
--dest-table 'databricks.accuweather_forecast_hourly_metric'
Query data:
crash -c 'SHOW CREATE TABLE databricks.accuweather_forecast_hourly_metric'
crash -c 'SELECT * FROM databricks.accuweather_forecast_hourly_metric LIMIT 5'
Documentation
Other options
You can use ingestr to import data from a variety of other data sources. CrateDB also provides integrations for many other ETL applications and frameworks.