Shared volumes in Airflow — the good, the bad and the ugly

Jarek Potiuk
Apache Airflow
Published in
15 min readJul 25, 2022

This is my highly personal take on using shared volumes for Airflow to share DAG files (and Plugins — but I will use DAG files to shorten it) between Airflow components.

I know this might be a controversial subject — I shared my view with a number of people in Airflow Slack and Github Issues/Discussions, and I know what I write here might be seen as controversial.

But hey, medium blog post is a nice way to express your thoughts in — hopefully — clearer way than an ad-hoc discussion, and the blog is mine so why not to share my opinion here.

I think it’s a good opportunity to describe why I think shared volume is often not the best choice for sharing DAGs and Plugins in your installation. I hope after reading it, you will understand when it might make sense but also when moving from shared volumes to Git Sync might be a good idea.

My view on the subject is that while shared volumes are easy and good to start with, eventually as your airflow installation grows and matures, moving to direct Git Sync is the best approach.

Possible evolution of Airlfow share volume approach

Context

Shared volumes is one of the ways you can share DAG files (and Plugins) among your components. There are few ways as we described in the documentation of our Helm Chart you can use. There are a few other ways not mentioned in our Helm Chart documentation (because the Helm chart is cloud-agnostic). Managed Deployments of Apache Airflow offer an object-storage (S3/GCS) synced solutions for example. But those cloud-based solution are equivalent to using shared volumes(in our Chart implemented by PVCs — Persistent Volume Claims).

The possible options for DAG sharing essentially boil down to:

  • pre-baking the DAGs in the Airflow Image
  • sharing the DAGs via shared volumes (that’s the PVC/GCS/S3 approach)
  • using Git-Sync to synchronize your DAG files

Pre-baking the images has some obvious drawbacks — mainly that you need to redeploy the image to change/add DAG files. But I personally think that general benefits and usability of “shared volumes” is quite overestimated by many users. And often unknowingly, they stick to themeven if their installation grows and requires better engineering practices — and “shared volumes” are rather an obstacle than help.

The good

Let’s start with the good things. Why shared volumes might be a good choice?

Simplicity for your users is the main thing that comes to my mind for those. What’s easier to our users than “folders’ ‘ that they can “drop their files in’ ‘? There is nothing simpler. Just dedicate a volume on a shared network that they can drag & drop their files, or run a cp command to copy the files — and in a few seconds (or minutes but we will get to that shortly) the files will magically appear in DAG folder or scheduler and workers and get executed. And if your users are mostly data scientists, who are used to iterate and change their files locally and experiment and quickly deploy stuff by just copy pasting they do not need to learn any new tools, nor follow any rigorous deployment workflow — hey we just copy file here and … it works.

Sounds cool? Yeah. Because it is cool.

This is how Shared Volume work

When you have a small-ish installation and a handful of DAGs that are mostly accessed by one user, this is a perfect solution. And yeah, in such a case I’d heartily recommend deploying Airflow with shared volumes.

The bad

But there is a nasty side of that that you do not see at first, but when your orchestration needs grow and your team grows, start to show their nasty “Hydra-like-heads”. As a software engineer without much hair on my head left, I still recall the times when we did the same with our software.

That was not a long ago (you will likely not believe it but it was in the 21st century)when I started to work in a small startup (I will not mention the name here) and to my utter disbelief I found out that we were editing the code directly on a shared volume on a large “company server” without any version control. And the startup owner (software engineer) claimed that this is “enough”. He was perfectly happy with “we have the regular backup” and “our server has a disk array to keep it”. Yep, 21st century it was. If you write any software in any company nowadays, you would be quite surprised to see it happen. I was even back then. (Side story — next day I introduced — modern then — Mercurial — to keep a bit of sanity in my new position). This was just scary not to have a modern control over your software development practice.

Hmmm,— does it remind you something?

Yeah, DAGs ARE a code. If you keep them in a shared folder, how do you keep track of what happens with the DAG code? If you have a team of people working on the same set of DAGs and share some code — how would they solve the conflicts? Would they override each other’s code? If you manage a team of people working on your head — wouldn’t you start tearing hair out of despair what might happen if they DO start overriding each other’s code (I’d certainly start if I had any hair left).

Enter Data Engineering Best Practices.

Last few years we’ve seen tremendous growth of maturity in this area — following what happened in software engineering few decades earlier. Airflow is one of the best examples — there is a reason why it is the most popular, truly open-source orchestrator in the world of data engineering — because it promotes good data engineering practices, and it is actually one of the most popular data engineering platform out there in general.

And you can even see how Data Engineering Best practices became one of the most important subjects for Airflow users. If you look at the talks of Airflow Summit 2020, Airflow Summit 2021 and finally Airflow Summit 2022 — you will see how “Data Engineering Best Practices” are maturing — first as “something we want to do” then “something we attempt to do” and in the final year “something we already do and BTW. if you don’t — you are behind”. I know for a fact, because as one of the organizers, I watched ALL the talks, so you don’t have to for ALL summits, and every year i am amazed how our users mature in terms of engineering practices.

Shared Volumes does not help with those practices. While in some cases (versioned object storages) you can keep track of the history of uploaded files at most. But there is no way to see what changed, who changed and when. You have no idea which version was used at any given time. you do not know if someone introduces a “fat finger” typo in any of your DAGs.

What you then start to do — you introduce those practices (and this is generally a very good idea). You start to keep your DAG files in — usually Git, you start to track the history, you start to see who changed what, possibly (and that’s highly recommended) you introduce code reviews, and maybe even (this is fantastic if you do) tests in CI that fail if you detect some DAG problems.

All this is great and if you are growing out of the “small orchestration needs” — this is very important for your business to apply those practices.

But then there is the next step — you probably still move the dags to the shared volumes. And I saw many ways of doing it — someone manually syncing the files, automated scripts that packaged the files and unpacked them to the shared volume, even recently I’ve heard of automated process that regularly sent the files over SSH connection to an AWS instance to put them on EFS shared volume.This all seems complex and brittle.

Many organizations — when introducing the good engineering practices — only do it in the “DAG authoring” part.

Good Engineering Practices AND shared volumes

But what if, the organisations also introduce at the DAG distribution side.

The initial thought when you introduce good Data Engineering practices are:

  • “How can we put git files into the shared volumes of Airflow?

But I rather think that in many of those cases the question should be:

  • How can we put git files and send them to Airflow?

Notice the lack of a “shared volumes” there? Yeah, that’s intentional. Shared volumes are not necessary in this case, and they actually get in the way.

Enter Git-Sync

Git Sync actually fits in here very nicely.

  • It removes the middle-man shared volumes completely.
  • It allows all Airflow components to independently synchronize their code with the GitSync “repository” in a way that is very efficient (Git was created to store and distribute changes in the code) and very flexible.
  • It allows to plug-in into your DAG development workflow. You can set designated branches to be “Releasable’’, you can tag the releases and keep track of what has been deployed when.
  • You can combine DAG code coming from multiple independent repositories into single one via submodules and

Git-sync is perfect fit for all the modern Data Engineering Practices to make your DAG code directly deliverable to your Airflow.

This is how Git sync works

Do not just believe my words. I am but a humble commiter of Airflow — and you might be surprised that I do not really run Airflow in production myself. But — maybe you will believe other users. Here is the fantastic “Manage DAGs at scale” presentation from the Airflow Summit 2022 where Anum Sheraz from Jagex described how they mange190(!) Git DAG Repositories (and are extremely happy with this setup).

So — if you have not thought about removing the shared file system from the picture — you can, because you do not need it any more when you start improving your engineering practices.

But, this should not be the reason for you to switch. There is the famous saying “The fact that you CAN do something does not automatically mean that you SHOULD”. Let me argue why you SHOULD.

The ugly

There is one really, ugly part of using Shared Volumes, that you don’t realize until your DAG file number grows and your team grows and you start having a lot of the DAG files and they start to change a lot.

The problem is, that stability, when you grow, can usually only be bought by (much) more money that you have not anticipated (and it also has limits).

If you use cloud (who doesn’t nowadays) and you are bought into your cloud platform you would not really deploy your own file system. You should deploy something like (for example) Amazon EFS. Let’s stick to this example, but this chapter applies pretty universally to many other shared volumes like that. It turns out that the more DAGs you have and the more they change, you will quickly find out that the very basic, “almost free” offering of the EFS is not nearly enough.

Many of our users who had stability problems with their Airflow tracked it down to stability and performance of the underlying filesystem. After some periods of instability they bought many more IOPS and poof! magically their Airflow stability became rock-solid.

Why is this so?

Partly because the customers believe in the magic of shared volumes, and partly because Airflow uses DAGs folder in the way that reveals that in-fact shared volumes are not magical at all.

You need to understand what happens under-the-hood when you use a shared volumes. The shared volumes (EFS-including) provides you ILLUSION (hence the magic) of something that works like a local filesystem. In most cases that illusion seems to work — you display the content of the folder, you can read and write your files and all seems to work as if your files were right there on your local disk.

I am afraid — I have to act a little nasty and break the illusion. The illusionist is just pretending this is happening. In fact there are a lot more things going on and there is absolutely no escape that the files are actually pulled over the network from some kind of storage which is actually — bear with me — somewhere in the network at a different machine (or usually distributed among multiple machines). The details of that differ in different filesystems, but there is no magic (or rather “Any sufficiently advanced technology is indistinguishable from magic” — quoting Arthur C. Clarke).

EFS under-the-hood uses NFS (Networking File System). While the “Elastic File System” name is cool, and it’s solved as “serveless” solution, in fact it has servers in Amazon Network, but they are managed by the AWS team. If you are interested here is a nice description on how NFSv41 works — EFS simply uses the NFS (which is standardised by the IETF as RFC3530).

Airflow Scheduler works in the way (and this is by design) that it continuously scans the DAGs folder and reads all the files there. Continuously. Non-stop. All files.

Let it sink for a while.

This basically means that all the time your EFS is bombarded by requests of scanning and reading DAG files. All the time.

If you look at the NFC protocol implementation — all the communication with the servers happen via Remote Procedure Calls (RPC). And they are serialized. This basically means that the more small requests happen, the more serialized the communication is. NFSv41 has a good support for bundling those together, but when you have continuous scan/read/commands for multiple files, this can only help a little.

This is how NFS (and EFS) looks like under-the-hood

NFSv41 has a clever trick — the server can grant delegation to a client for particular files — based on the access patterns — which makes it possible for the client cache files and act on the files as if it was accessed locally. But the problem with this feature is that you cannot control it from the client side — it is entirely server-based — and there are many factors that can break it (for example delegation does not work at all if you are behind a NAT gateway as it requires server callbacks). But this is not a well known fact and you have neither control nor generally the knowledge on whether it is used or not.

Even assuming your local EFS has some cache to store the files — if the cache it has locally, is not enough and when you have more files in your folder, it means that the local cache will be continuously evicted and the files — even those that that were “delegated” to you, will be re-downloaded again. From the user perspective it is a bit of magic. You open a file, read it and it looks like the file was locally available, but with the pattern of Airflow scanning and rescanning the folder and re-reading all the files continuously, all the files might have to be actually continuously downloaded over the network.

Even if you look at AWS EFS performance tips they mention that local file caching might be enabled but it has no impact on latency, which means that anyhow the EFS servers are contacted with every single access to every single file:

The distributed nature of Amazon EFS enables high levels of availability, durability, and scalability. This distributed architecture results in a small latency overhead for each file operation. Because of this per-operation latency, overall throughput generally increases as the average I/O size increases, because the overhead is amortized over a larger amount of data.

There is also an interesting observation you can make (this is a little side-comment). You know what you are paying for when you use EFS (and generally other similar distributed volumes)? Yes — it’s mentioned above — reliability, durability, scalability, distributed architecture. This sounds really cool. But … think for a while. If you you follow good engineering practices, and keep your files in a “solid” Git server do you ACTUALLY need any of those? Your Git Server is already reliable, durable, scalable. It is likely already highly distributed and accessible fast from wherever — (if not, then, due to Pandemic, and your employees being distributed all-over the world you should generally have it distributed). Do you need any of those on your filesystem that JUST keep the snapshot copies of something you already version, store, and backup? Do you really think you should pay for all those features you actually don’t need?

Coming back to the main topic — the DAG files — by their nature — are rather small. Or should be if they are not. DAGs are a code and the Good Engineering Practices (when you apply them) are very explicit about it — keep your Python modules small.

This basically means that your EFS needs all the IOPS it can to deliver the scalability, reliability and distribution (none of which you actually need) in order to sustain the constant pressure. The more your system will grow, the more you will experience increased latency. The more good engineering practices for your DAG code you implement, the worse it gets. And when it has not enough of the IOPS — nasty things happen.

Enter atomic updates

When EFS does not get enough IOPS to sync files, what you can observe is that some files are refreshed with some delays. And the problem is that filesystems like EFS do not provide “whole DAG folder” consistency. If you have delays in networking (not enough IOPS), and you have a lot of changes in your files to distribute, it is pretty normal that some files have newer version and some have — older versions.

Imagine your DAG:

Example DAG using shared function

The “my_company.common_code” is a module that is shared between multiple DAGs.

Imagine one of the DAG authors changes the funciton from “shared_function” to “shared_util”. The change is done in both DAG file and “common_code” file and the files got copied to the EFS.

Suprisingly — what might happen next your Airflow component might end-up with the situation that the first file is still old, and the second file is new.

What happens then ? It depends. If you are in Scheduler, you get an import error. If you are a worker, you have task failure. And both cases are pretty mysterious — because you can locally see that all is correct.

This is because a shared volumes do not guarantee atomic changes in more than one file. If it happens you might end up in a situation that your Airflow is not stable and starts to have more and more random failures — the more DAGs and changes you have, the worse it gets.

Now — what happens if your company is growing and you start experiencing it — yep you start to buy more and more IOPS, even if this was something that was not initially needed and anticipated (you anticipated before you need more storage, but you never expected more IOPS).

This is what is almost inevitable to happen when you grow. And I’ve heard this story a number of times from our users.

Git sync to the rescue

Yep. You guessed it. Git-sync is free of both of those problems.

  • you only need local volumes for Scheduler when you use Git-Sync. Those local volumes can be scanned as fast as you want, as often as you want and you are all but guaranteed that you will not be surprised by anything else but more storage capacity needed if you grow
  • Git-Sync has built-in atomic updates. You are guaranteed that what you see in your filesystem DAG folder at a given time will stay like that. If your DAG File Processor starts parsing a DAG, it will parse it with a fully consistent version of the DAG and all other files — linked together with the Git Commit they came from.
  • Your workers benefit from the same atomic consistency and local volume access — once git-sync did the job, there is NO MORE network communication involved when your worker picks another task and parses the DAG file to run your code. This all happens locally within the worker!

There are people who question the performance of GitSync vs. the shared volumes. But I think they discount the fact that Git protocol was — from grounds up — designed to track and share changes to source files and that it is highly, highly optimized for that purpose, also that any modern way of sharing your code (GitHub/GitLab) are build with scalability in mind. Also they discount the fact that Git-Sync only needs to sync changes to DAGs and ONLY when they change. What Git-Sync trades-off is to pull “some changes” rarely vs. “continuous EFS scanning and downloading”. Those two are few orders of magnitude apart.

Conclusion

Ok. that was a bit long one, but let me summarize it:

  • When you start small and you care for convenience of your DAG authors to upload their changes, shared volumes might be a good option.
  • But when you grow and want to apply good engineering practices, instead of leaving the shared volumes as middle-man between Git and Airflow, better use Git-sync. You will end up with a simpler, more stable and most of all cheaper solution that will not surprise you with sudden cost increase when you grow.

Let me just summarise it with this image.

--

--