Quickstart

Preprequisites

Although the plugin does not perform deployment, it’s recommended to have access to Airflow DAG directory in order to test run the generated DAG.

Install the toy project with Kedro Airflow K8S support

It is a good practice to start by creating a new virtualenv before installing new packages. Therefore, use virtalenv command to create new env and activate it:

$ virtualenv venv-demo
created virtual environment CPython3.8.5.final.0-64 in 145ms
  creator CPython3Posix(dest=/home/mario/kedro/venv-demo, clear=False, no_vcs_ignore=False, global=False)
  seeder FromAppData(download=False, pip=bundle, setuptools=bundle, wheel=bundle, via=copy, app_data_dir=/home/mario/.local/share/virtualenv)
    added seed packages: pip==20.3.1, setuptools==51.0.0, wheel==0.36.2
  activators BashActivator,CShellActivator,FishActivator,PowerShellActivator,PythonActivator,XonshActivator
$ source venv-demo/bin/activate

Then, kedro must be present to enable cloning the starter project, along with the latest version of kedro-airflow-k8s plugin and kedro-docker.

$ pip install 'kedro<0.17' kedro-airflow-k8s kedro-docker

With the dependencies in place, let’s create a new project:

$ kedro new --starter=git+https://github.com/getindata/kedro-starter-spaceflights.git --checkout allow_nodes_with_commas
Project Name:
=============
Please enter a human readable name for your new project.
Spaces and punctuation are allowed.
 [New Kedro Project]: Airflow K8S Plugin Demo

Repository Name:
================
Please enter a directory name for your new project repository.
Alphanumeric characters, hyphens and underscores are allowed.
Lowercase is recommended.
 [airflow-k8s-plugin-demo]: 

Python Package Name:
====================
Please enter a valid Python package name for your project package.
Alphanumeric characters and underscores are allowed.
Lowercase is recommended. Package name must start with a letter or underscore.
 [airflow_k8s_plugin_demo]: 

Change directory to the project generated in ${CWD}/airflow-k8s-plugin-demo

A best-practice setup includes initialising git and creating a virtual environment before running
 `kedro install` to install project-specific dependencies. Refer to the Kedro
  documentation: https://kedro.readthedocs.io/

TODO: switch to the official spaceflights starter after https://github.com/quantumblacklabs/kedro-starter-spaceflights/pull/10 is merged

Finally, go the demo project directory and ensure that kedro-airflow-k8s plugin is activated:

$ cd airflow-k8s-plugin-demo/
$ kedro install
(...)
Requirements installed!
$ kedro airflow-k8s --help
```console
$ kedro airflow-k8s

Usage: kedro airflow-k8s [OPTIONS] COMMAND [ARGS]...

Options:
-e, --env TEXT  Environment to use.
-p, --pipeline TEXT  Pipeline name to pick.
-h, --help      Show this message and exit.

Commands:
  compile          Create an Airflow DAG for a project
  init             Initializes configuration for the plugin
  list-pipelines   List pipelines generated by this plugin
  run-once         Uploads pipeline to Airflow and runs once
  schedule         Uploads pipeline to Airflow with given schedule
  ui               Open Apache Airflow UI in new browser tab
  upload-pipeline  Uploads pipeline to Airflow DAG location

Build the docker image to be used on Airflow K8S runs

First, initialize the project with kedro-docker configuration by running:

$ kedro docker init

This command creates a several files, including .dockerignore. This file ensures that transient files are not included in the docker image and it requires small adjustment. Open it in your favourite text editor and extend the section # except the following by adding there:

!data/01_raw

This change enforces raw data existence in the image. Also, one of the limitations of running the Kedro pipeline on Airflow (and not on local environment) is inability to use MemoryDataSets, as the pipeline nodes do not share memory, so every artifact should be stored as file. The spaceflights demo configures four datasets as in-memory, so let’s change the behaviour by adding these lines to conf/base/catalog.yml:

X_train:
  type: pickle.PickleDataSet
  filepath: data/05_model_input/X_train.pickle
  layer: model_input

y_train:
  type: pickle.PickleDataSet
  filepath: data/05_model_input/y_train.pickle
  layer: model_input

X_test:
  type: pickle.PickleDataSet
  filepath: data/05_model_input/X_test.pickle
  layer: model_input

y_test:
  type: pickle.PickleDataSet
  filepath: data/05_model_input/y_test.pickle
  layer: model_input

Finally, build the image:

kedro docker build

When execution finishes, your docker image is ready. If you don’t use local cluster, you should push the image to the remote repository:

docker tag airflow_k8s_plugin_demo:latest remote.repo.url.com/airflow_k8s_plugin_demo:latest
docker push remote.repo.url.com/airflow_k8s_plugin_demo:latest

Setup GIT repository

Plugin requires project to be under git repository. Perform repository initialization and commit project files

Compile DAG

Plugin requires configuration to be present. It’s best to use:

kedor airflow-k8s init --with-github-actions --output ${AIRFLOW_DAG_FOLDER} https://airflow.url

This command creates configuration file in conf/pipelines/airflow-k8s.yaml with some custom values and reference to Airflow passed in arguments. It also creates some default github actions.

When using this command, pay attention that the configuration expects commit_id and google_project_id to be present. Set them up by setting environment variable KEDRO_CONFIG_COMMIT_ID and KEDRO_CONFIG_GOOGLE_PROJECT_ID.

Also mlflow configuration has to be set up (if required by the project) as described in mlflow section.

Having configuration ready, type:

kedro airflow-k8s -e pipelines compile

This command compiles pipeline and generates DAG in dag/airflow_k8s_plugin_demo.py. This file should be copied manually into Airflow DAG directory, that Airflow periodically scans. After it appears in airflow console, it is ready to be triggered.

As an alternative, one cas use the following:

kedro airflow-k8s -e pipelines upload-pipeline -o ${AIRFLOW_DAG_HOME}

in order to get DAG copied directly to Airflow DAG folder. Google Cloud Storage locations are also support with gcs:// or gs:// prefix in the parameter (this requires plugin to be installed with pip install kedro-airflow-k8s[gcp]).

In order to use AWS S3 as storage, prefix output with s3:// (this requires plugin to be installed with pip install kedro-airflow-k8s[aws]).

It’s optional to indicate which pipeline to pick, with -p option. By default, pipeline name __default__ is used. Option -p can refer to other pipeline by name it’s registered inside kedro hook.

Diagnose execution

Every kedro node is transformed into Airflow DAG task. DAG also contains other, supporting tasks, which are handled by a set of custom operators. In order to diagnose DAG run, every task is logging information with standar python logging library. The outcome is available in Airflow Log tab.