Magic Loop in Airflow — reloaded

Jarek Potiuk
Apache Airflow
Published in
8 min readJul 23, 2022

Magic Loop in Airflow — reloaded

Quite recently, we published in Airflow the Magic Loop blog post by Itay Bittan who was inspired by our discussion in Airflow Slack on how he could improve execution delay for dynamic DAG parsing.

This is a follow-up and well, even more magic is involved (or in-fact complexity — but more about it later). I describes how “simple magic” can become even more magical, and finally it gets so magical that you decide to replace the “magic” with “robust product” — because the magic is a bit too, well, magical.

And it comes with the “Community over Code” twist so if you are interested in how true Open — Source community works — read on.

Image credit: https://freesvg.org/benbois-magic-ball

The story begins

It all started with a slack discussion “I see delays in our task execution” by Itay Bittan and myself trying to understand what’s the problem and trying to help. There were a few observations we had during the discussion:

  • they create many DAGs in a loop (1000s) in a single DAG file
  • when any task is executed they experience ~ 2 minutes delay on parsing of all the DAGs (because every task execution effectively has to FULLY PARSE the DAG file it comes from
  • each of the DAGs there is independent from each other and there are no side effects of creation of the images

I suggested to implement a “magic” loop — to skip all other DAG creation when the task is being executed (unlike than DAGProcessor parses the file for schedule). And BTW the “magic loop” name was invented by Itay.

The solution

Not long time passed when Itay implemented it and nicely described how it worked for them — they got whooping 120 ms instead of 2 minutes delay when executing the DAG.

That sounded FANTASTIC return of the simple solution. And my immediate thought was — yeah there are plenty of users out there who have to have similar problem — let’s help them by making it “1st class citizen” in Airflow. I wrote a devlist post about it, after some time got some responses and finally decided to add guidelines in the official Airflow documentation about it — but — since this is the “product” documentation, I wanted to make it a “product” quality.

Now if you are familiar with the “solution” vs. “product” difference, what happened next was exactly reflecting this difference. Over the years. I’ve learned that this is a very, very difficult journey to start with a simple solution, turn it into a reusable one, and finally turn it into a product. The rule of thumb I’ve learned is:

  • solution to a specific problem = costs you x
  • making it reusable = costs you 3x
  • turning it into a product = 3x making it reusable = 9x the original cost

And boy … how foretelling and UNDERESTIMATED it turned out to be this time.

Making it reusable

When I implemented the documentation Pull Request about it, I not only had to describe it, but also review and test the code and what I came up with was not what I was expecting initially.

Initially the code of Itay was — essentially — this:

Original code solving the problem

Looks simple enough, right? I thought, yeah, I will have to add a little robustness, and it should be easy — let me just make sure that all the configuration is well covered. Some of my fellow Airflow committers (like Felix Uellendal) encouraged me to continue, when I posted the first version, but then some vigilance of others — in this case it was Bas Harenslak, kicked in and the question was asked “are all cases covered here ?”.

I eagerly attempted to review and check Airflow code, and … a few hours later, what was literally a few lines of code described by Itay in his post, turned out to be — unsurprisingly — way more than 3x amount of code needed (but it was well tested and quite robust, I thought).

The original code only worked for the very configuration they had at Itay’s company. They used Kubernetes Executor and hacking it to retrieve current dag and task processes was simple. However, Airflow is more than that. Airflow can have Kubernetes Executor, Celery, Local Executor, CeleryKubernetesExecutor, the last three can be run via starting a new interpreter or fork, and there could be custom executors as well. After looking at the code, digging in and running some tests (and few iterations of those) I came up with this:

Reusable solution of the problem.

Hmm.. it’s even more than 3x as complex and in parts it is based on reading the current process title (the only way to read the information in the “fork” case I could rely on). And it was not perfect either. There are at least a few edge cases I could see there which could have happened (not very likely) that would break it (and fall-back to original “all DAGs” parsing(.

But, well, there is also a probability that some of this will actually fail (hence the try/except Exception around).

I proceeded with proposing the PR, but there was something worrying about it in the back of my head. Suddenly the idea of “simple hack” turned into “really complex hack”.

The power of community

While I got some “great”, “fantastic improvement” comments, as part of the review, there were also some other comments.

First — Ping Zhang — one of the Airflow Committers mentioned something that I also wanted to do — we should turn it into a “reliable feature” of Airflow — reliably passing the context in which the DAG is parsed. And yeah I actually proposed it from the beginning. Then Ping mobilized me to make another PR with “proper” implementation (but more about it later).

But then, there was a wake-up call. One of the committers, and PMC member Jed Cunningham expressed his distress about it. First very gently, but when I iterated and asked for more feedback and tried to address what I thought was the problem, he came back withs something like “I am not sure if I am able to express it, but …”.

And after reading it — I couldn’t agree more.

First of all — this is a lot of code to copy & paste, that will survive the PR. If we make it a part of official documentation — it will stay there. It will become part of the product, and suddenly, we will have to start maintaining it and keep somehow the compatibility. Even if I added an “experimental” bit, we all know that those “experimental” bits tend to remain in the code forever. I was already on the verge of “should we really do that” — but Jed’s comments were the last drop.

After reading and re-reading it, I decided — “no we cannot make it into our documentation actually”. But — I can write a blog about it (here you read it actually). We should not make it part of the product — Airflow is definitely a product not a solution and we should treat it with all the seriousness it needs. So I knew I would have to close the “documentation” PR and work more on getting the “product” approach. There you go.

Turning it into a product

Finally the “product” version of the PR looks like this:

Proper “product” implementation

Yep. This is the PR that implements the same feature in the “product” way. This is the change that provides appropriate context to parsing. And it has it all:

  • it contains all the tests and harnesses that protect against various edge cases
  • it provides future-compatible API implementation that we will be able to take care about and maintain in the future
  • it is implemented in the way that even allows to implement “python __future__” kind of approach — so that we could have it backported to an earlier version of Airflow

This is even more than 9x the original code. Yet it also provides our user a very simple and straightforward, Pythonic API that they can rely on when optimizing their DAG.

And this is likely what we are going to go forward with. But it is going to be only available in the next version of Airflow, because even if we want to help our users, we cannot do it in a way that will be unsustainable. Finally Felix commented on the PR something along the lines of “this is so much cleaner and better, indeed it was a bad idea to share it in the original form”.

Conclusions

If you’ve gone that far — one important take away that I think you can take from this post — never underestimate the effort needed to turn a simple solution into a fully-featured product. Even if originally you have a few lines of code, when you want to make it a maintainable, long-term usable product feature, it will often take far more time and effort. Even the 3x, 9x multipliers are underestimated. Also — if you are developing a product — stopping at 3x “reusable” solution is not good enough, and you should resist the temptation to release it when it’s ��half-product” or when it makes your product vulnerable to long-term maintenance issues.

But there is a deeper learning and wisdom hidden in the story.

As you might see from the whole story, even if you came up with the best idea in the world and you see great improvements as the result and initially you see that it solves a lot of problems, it might not be the best to get it out immediately. Airflow is the Apache Software Foundation project and “Community over Code” is one of the most important things of the Apache Way and you can see the embodiment of it in this story.

Community over Code

What started from the user who had a problem and raised in the community discussion channel, then came an idea to solve it turned out into a small but useful feature in Airflow — avoiding a few traps along the way. And it happened only because we have a great community of people who collaborate, trust each other and are able to (more or less gently) argue and see different perspectives, and great things happen as result of such a cooperation.

Bouncing ideas from each other, not being afraid of expressing your doubt, and especially — not being afraid to change your approach as the result of feedback you got — this is the true power of Airflow. This is one of the reasons you can rely that what we come up with in Apache Airflow is not the result of a single power, person or organization decision (as happens in other “open source” projects sometimes) but is a result of true collaboration between all the different individuals who ARE the community.

This is what Apache Way all but guarantees.

--

--