About
OpenLineage
OpenLineage is the open source industry standard framework for data lineage. It standardizes the definition of data lineage, the metadata that makes up lineage data, and the approach for collecting lineage data from external systems.
OpenLineage integrates with Airflow to collect DAG lineage metadata so that inter-DAG dependencies are easily maintained and viewable via a lineage graph, while also keeping a catalog of historical runs of DAGs.
Marquez
Marquez is OpenLineage’s lineage repository reference implementation.
Setup
On a fresh Ubuntu machine (22.10), let’s first start Marquez:
sudo apt install docker-compose
git clone https://github.com/MarquezProject/marquez && cd marquez
sudo ./docker/up.sh
Install the Astro CLI:
sudo apt install curl
curl -sSL install.astronomer.io | sudo bash -s
Let’s initialize a project folder:
mkdir datalineageeval
cd datalineageeval
astro dev init
Edit requirements.txt
which is empty, add a line like this:
apache-airflow-providers-postgres
Create an .env
file:
OPENLINEAGE_URL=http://172.17.0.1:5000
OPENLINEAGE_NAMESPACE=example
AIRFLOW__LINEAGE__BACKEND=openlineage.lineage_backend.OpenLineageBackend
Edit .astro/config.yaml
, adding these lines at the end:
postgres:
port: 5435
Operations
Then, start Astronomer:
sudo astro dev start
As port 5432 is used by Marquez’s own PostgreSQL instance, and port
5435 is used by Astro’s, we will start CrateDB on 5436.
sudo docker run --publish=4200:4200 --publish=5436:5432 --env CRATE_HEAP_SIZE=1g crate:latest -Cdiscovery.type=single-node
Usage
Then, continue from step 2 from the “Generate and view lineage data”
section of Integrate OpenLineage and Airflow with Marquez | Astronomer Documentation.
In step 2:
-
use
TIMESTAMP
instead ofDATE
for the columns -
Prefix all references to table names with a schema of your choice
In step 3:
-
Prefix all references to table names with the same schema used earlier
as otherwise Astro looks for the tables’ metadata inpublic
(already
raised with Astronomer) -
use
172.17.0.1
, port5436
, and usercrate
for the
connection, do not specify a “schema”, this is
actually referring to a PostgreSQL database name
(also raised with Astronomer) -
use
TIMESTAMP
instead ofDATE
foranimal_adoptions_combined
andadoption_reporting_long
-
copy the code for the DAGs in the tutorial as
.py
files under
datalineageeval/dags
-
to see the jobs in Marquez, in the upper right corner, select
“example” = instead of “default”