OpenLineage with Airflow, Marquez, and CrateDB

About

OpenLineage

OpenLineage is the open source industry standard framework for data lineage. It standardizes the definition of data lineage, the metadata that makes up lineage data, and the approach for collecting lineage data from external systems.

OpenLineage integrates with Airflow to collect DAG lineage metadata so that inter-DAG dependencies are easily maintained and viewable via a lineage graph, while also keeping a catalog of historical runs of DAGs.

Marquez

Marquez is OpenLineage’s lineage repository reference implementation.

Setup

On a fresh Ubuntu machine (22.10), let’s first start Marquez:

sudo apt install docker-compose
git clone https://github.com/MarquezProject/marquez && cd marquez
sudo ./docker/up.sh

Install the Astro CLI:

sudo apt install curl
curl -sSL install.astronomer.io | sudo bash -s

Let’s initialize a project folder:

mkdir datalineageeval
cd datalineageeval
astro dev init

Edit requirements.txt which is empty, add a line like this:

apache-airflow-providers-postgres

Create an .env file:

OPENLINEAGE_URL=http://172.17.0.1:5000
OPENLINEAGE_NAMESPACE=example
AIRFLOW__LINEAGE__BACKEND=openlineage.lineage_backend.OpenLineageBackend

Edit .astro/config.yaml, adding these lines at the end:

postgres:
  port: 5435

Operations

Then, start Astronomer:

sudo astro dev start

As port 5432 is used by Marquez’s own PostgreSQL instance, and port
5435 is used by Astro’s, we will start CrateDB on 5436.

sudo docker run --publish=4200:4200 --publish=5436:5432 --env CRATE_HEAP_SIZE=1g crate:latest -Cdiscovery.type=single-node

Usage

Then, continue from step 2 from the “Generate and view lineage data”
section of Integrate OpenLineage and Airflow with Marquez | Astronomer Documentation.

In step 2:

  • use TIMESTAMP instead of DATE for the columns

  • Prefix all references to table names with a schema of your choice

In step 3:

  • Prefix all references to table names with the same schema used earlier
    as otherwise Astro looks for the tables’ metadata in public (already
    raised with Astronomer)

  • use 172.17.0.1 , port 5436 , and user crate for the
    connection, do not specify a “schema”, this is
    actually referring to a PostgreSQL database name
    (also raised with Astronomer)

  • use TIMESTAMP instead of DATE for animal_adoptions_combined
    and adoption_reporting_long

  • copy the code for the DAGs in the tutorial as .py files under
    datalineageeval/dags

  • to see the jobs in Marquez, in the upper right corner, select
    “example” = instead of “default”

1 Like