Unraveling the Code: Navigating a CI/Release Security Vulnerability in Apache Airflow

Jarek Potiuk
Apache Airflow
Published in
8 min readDec 13, 2023

Introduction:

In the ecosystem of open-source development, where lines of code interweave and developers are using more and more complex tools and processes to build and release them, projects like Apache Airflow navigate a delicate balance between innovation and security.

Recently, the project faced a case of a vigilant bug bounty hunter who discovered a flaw within the GitHub Actions workflow. This blog post sends you on a journey — from the revelation of the vulnerability to the meticulous remediation steps taken to fortify the project’s defenses.

Software supply chain security

The Discovery:

The narrative unfolds with the discovery of a detailed bug bounty report, titled “Code Execution in Github Actions workflow allows secret exfiltration.” This report became the lodestar, pointing towards an anomaly within the execution of CI commands during the crucial release preparation phase. Specifically, commands used during the package preparation such as breeze prepare-provider-packages and breeze prepare-airflow-packages came under scrutiny.

CI and release build context:

To appreciate the gravity of the vulnerability, let’s delve into the intricacies of Apache Airflow’s release preparation process. These tasks, integral to the project’s package preparation, traditionally use the breeze command that is also used during CI jobs. The reason for having dedicated commands is simple — reproducibility. The breeze development environment is a wrapper around common actions executed in CI, development environment and release process, where docker containerization strategy was adopted for various reasons— avoiding “works for me” syndrome, eliminating the need for separately maintained virtual environments, facilitating the handling of complex multi-step operations and long commands to build and run containers for testing, and ensuring a controlled environment during tasks like executing setup.py for airflow packages.

The Vulnerability Unveiled:

The crux of the matter emerged with the discovery of a subtle yet impactful typo in the GitHub Actions workflow. This inadvertent oversight allowed code from public pull requests to escape the confines of the container during the pivotal “Build image” workflow, — which by its nature — had to have a write access to Github Container Registry — in order to share cached docker images. The docker cache speeds up immensely the tests in CI and is essential in setting up the local development environment. While seemingly innocuous, the true peril lurked in the sophisticated realm of Docker cache poisoning.

The Danger:

Docker cache poisoning took center stage as the exceptional danger stemming from the vulnerability. In essence, a malicious actor wielding write access to the project’s cache could manipulate an image. This manipulation involved injecting seemingly legitimate commands while discreetly infusing altered binary code — an elusive act that made detection challenging. The ultimate risk materialized in the compromise of the release manager’s image, potentially introducing malicious code into the packages slated for release.

Assessing the Risk:

Being part of the Apache Software Foundation, the Apache Airflow Project has a very sound process of release preparation and verification — with multiple, independent PMC members (minimum 3 of them) performing verification of the released artifact’s provenance and integrity. They employed a multi-stage process where PMC members verify signatures, checksums, licenses in the released artifacts and check if the packages are generated from the sources that are tagged with a signed tag in Git repository. So it was likely that any attempt to tamper with the process there would have been caught at the verification process. While the report had shown that the vulnerability was real, other safeguards still held. It was not as bad as it could be if our processes were not sound, documented and meticulously followed — as the ASF processes mandates.

Assessing the risk landscape, it becomes evident that while the danger wasn’t immediate, the potential for exploitation loomed large. The hypothetical attack scenario involved a bug within our CI, an anonymous pull request from a remote PR that did not have to even be approved, and meticulous manipulation to avoid detection — a sophisticated dance that was complex to perform, but it could be attempted in case targeted attack against the most popular workflow orchestrator — used by tens of thousands bigger and smaller users all over the world.

Funding security improvements of Apache Airflow:

So it happened, that the issue has been reported while a team of individual contributors and PMC members got funding by the Sovereign Tech Fund to improve security and release processes of Apache Airflow as part of the Contribute Back Challenge (Round 1) . It has been announced at the ASF blog — initiative which is rather important in the light of upcoming security regulations, and is one of the components of long term strategy on open-source. Since regulations in this area are coming in multiple regions — for example CRA in Europe being at the last stage of negotiations, it’s more and more important that there are various models of funding “ground security-focused work” — that might otherwise be seen as afterthought where people working on OSS projects are mostly focusing on “new features” to develop.

This was a very nice coincidence because the individuals that got the funding have been focused on the very subject and also had the time reserved, plans in progress and money to support the investment to actually implement a lot of improvements in the process to address the issue.

Remediations Implemented:

Now, let’s unravel the layers of strategic remediations that were meticulously implemented to fortify Apache Airflow against analogous threats.

  • Reducing Reliance on CI Image: A paradigm shift in approach was proposed — reducing heavy reliance on the CI image. The contemplation of leveraging a “generic” Python image was introduced, aiming to achieve the same level of isolation without inheriting potential risks associated with a compromised CI cache.
  • Process Improvements: Critical process enhancements took center stage. Release managers, entrusted with critical tasks, cut out reliance on the CI image. They were given a new process where reproducible builds. official Python images only and local environment were used rather than shared, remote binary CI images. This deliberate move minimized the risk of unauthorized modifications during the vital release and verification process.
  • Reviewing the build tooling: Review of the tooling of ours used in CI had shown that our Docker isolation reliance was not something that the Airflow team could depend on entirely. In 2020, Platypus Attack had been revealed which allowed an attacker to steal secrets from the machine they were running on, and while it has been addressed in general cases, Docker/Containers turned out to be potentially vulnerable to stealing secrets from the host, the containers were running on — Docker released the security advisory on it in October 2023, and once the Airflow team realized that it undermines some of the assumptions of ours regarding this scenario — upgrade to the latest versions of Docker that addressed the vulnerability by disabling access to powercap device happened immediately.
  • Enhanced Verification Methods: A robust upgrade to the verification process was introduced. By incorporating reproducible builds and non-shared-container builds, the PMC members now have simpler, more robust ways to verify the provenance of generated code. Local verification, combined with comparisons against GitHub tags, rebuilding the packages in a reproducible way and byte-to-byte reproducibility added an additional layer of security.
  • Retrospective Inspection of Past Releases: Acknowledging the potential vulnerability duration, a meticulous retrospective examination of past releases was recommended. This involved comparing code in historical releases to the current state, playing detective to identify and rectify potential tampering — an exhaustive yet imperative audit trail for added security.
  • Future work: While Airflow already hardened and improved the release process, the work is not complete yet. While Airflow Provider packages already have reproducible builds, the core airflow package is not yet there, more changes are needed and incorporating modern Python tooling to make it happen, but the team is on on a good track to get there — in this, and hopefully next, round of the “Contribute back challenge”.

Conclusion:

In conclusion, the journey from vulnerability discovery to remediation stands as a testament to the Apache Airflow community’s unwavering commitment to security and code integrity.

This blog post serves as a valuable lesson in vigilance, collaboration, and the continuous pursuit of excellence in code craftsmanship. As Apache Airflow navigates the intricate landscape of open-source development, this blog post stands as a beacon for other projects — urging them to maintain a vigilant stance, foster collaboration, and elevate their code’s resilience. The supply chain and build and release process of many open source projects has potentially flaws and weaknesses that could be exploited by malicious attackers.

It’s crucial to keep a tab on your project’s build and release process and tooling. Having funding for individuals who are experts in the projects they voluntarily contribute to is also helpful in making it happen. Even in well established and mature projects there are often things that can be improved and hardened, extra layers of protection can be added and continuous vigilance, quick reaction to raised issues and time to perform deeper analysis are necessary to keep up with security challenges.

Credits

The credits for finding the original issue go to Harish (@d3ku100 on hackerone) — not only for reporting it but for being persistent in explaining the issue and helping us to verify fixes once we applied them,

Appendix:

Due to limited size, the article does not dive deeply in details of the vulnerability and remediations, but since Airflow is an Open Source project, those interested in deeper-dive are free to take a look at some of the Pull Requests that implement improvements mentioned in this article. Also if there is enough of an interest, I might write a more detailed post diving deeper in more details of the problems discovered.

Fixing the original typo that caused the issue:

Switching to Python Official images for builds

Modernizing package building and reproducible builds

Upgrading Docker to avoid Platypus attack

--

--