After having worked hard developing a machine learning model, you know that there is still a relatively small step to do: moving it to production.

In a common scenario what you probably would like to have is a workflow to automate:

  • gathering and preprocessing the data
  • running inference on them
  • storing the predictions

Ideally you would want a tool that can help you:

  • dealing with big data
  • guaranteeing robustness and resilience
  • executing your workflows on a scheduled basis or when some pre-conditions are met
  • resolving dependencies between tasks

If until today you used cron to schedule jobs, this could be the right time to adopt a well established tool like Apache Airflow for addressing this complexity.

Apache Airflow is an open source project written in Python for programmatically author, schedule and monitor batch execution of tasks.

You can design your pipelines according to a determined logic: decide which actions to perform, retry them if errors occur, skip tasks if dependencies are not met, access monitor and log status through a friendly and powerful web UI, and a lot more.

A very nice feature of Airflow is that all the above is configured and defined in Python code. Therefore the Airflow pipelines can benefit from the advantages of the software development process (such as peer-reviews, automated testing and version control).

In this workshop we’ll go over basic Airflow concepts and we’ll setup an instance for orchestrating an inference pipeline for a machine learning model.

Enrica Pasqua

Affiliation: Delivery Hero SE

Enrica works in Berlin as a Senior Data Engineer at Delivery Hero, where she develops and maintains large scale data pipelines using Python. Her interests include Big Data Architecture, Process Automation and Machine Learning.

visit the speaker at: TwitterHomepage

Bahadir Uyarer

Affiliation: Delivery Hero SE