Introduction
Dbt [1] is a tool for transforming data in data warehouses using Python and SQL.
dbt enables data analysts and engineers to transform their data using the same practices that software engineers use to build applications.
The idea is that Data Engineers make source data available to an environment where dbt projects run, for example with Debezium or with Airflow. Afterwards, Data Analysts can run their dbt projects against this data to produce models (tables and views) that can be used with a number of BI tools.
dbt’s Features
The data abstraction layer provided by dbt allows the decoupling of the models on which reports and dashboards rely from the source data. When business rules or source systems change, you can still maintain the same models as a stable interface.
Some of the things that dbt can do include:
- Import reference data from CSV files
- Track changes in source data with different strategies so that downstream models do not need to be built every time from scratch
- Run tests on data, to confirm assumptions remain valid, and to validate any changes made to the models’ logic
dbt and CrateDB
Due to its unique capabilities, CrateDB is an excellent warehouse choice for data transformation projects. It offers automatic indexing, fast aggregations, easy partitioning, and the ability to scale horizontally.
In this article, I will illustrate how to get the most important functionalities of dbt working by doing the necessary changes in the configuration.
Getting started
For running the following steps, you will need connectivity to a CrateDB cluster, and a Python installation on your workstation. The starting point will be a fresh install of dbt-postgres
.
pip install dbt-postgres==1.6.0
Now, create a profiles.yaml
file with your connection details.
cd ~
mkdir .dbt
cat << EOF > .dbt/profiles.yml
example_datawarehouse_profile:
target: dev
outputs:
dev:
type: postgres
host: localhost
port: 5432
database: crate
schema: doc
search_path: doc
user: dbt
password: pwd1234567A
EOF
(please note the values for database
, schema
, and search_path
in this example)
A dbt project has a specific structure [2], and contains a combination of SQL, Jinja, YAML, and Markdown files. In your project folder, alongside the models
folder that most projects have, create a folder called macros
, and populate it with these macro override files:
- overrides.sql (3.9 KB)
- catalog.sql (2.0 KB)
- relations.sql (1.1 KB)
Those dbt features have been tested successfully:
- models with view, table, and ephemeral materializations
- dbt source freshness
- dbt test
- dbt seed
- Incremental materializations (with
incremental_strategy='delete+insert'
and without involving OBJECT columns)
We hope you find this useful. CrateDB is continuously adding new features and we will endeavor to come back and update this article if there are any developments and some of these overrides require changes or become obsolete.