Data Engineering Track - ApacheCon

Jarek Potiuk
3 min readMay 10, 2022

--

Image by Gerd Altmann from Pixabay

This year, together with Ismaël Mejía we are co-chairing the Data Engineering Track at ApacheCon NA 2022 in New Orleans, Louisiana in October 2022.

Since this is the first time the track appears at the schedule of the ApacheCon we wanted to explain what is our vision for it.

Why Data Engineering track at ApacheCon ?

In the last decade, many distributed databases and open-source Apache projects emerged for processing data at scale. They’ve quickly become the standard tools we use in the industry. However, processing data is not the only task we need to build a reliable and consistent data platform.

The Data Engineering track is about the ‘other’ open-source tools and libraries we use to clean the data, orchestrate workloads, do observability, visualization, data lineage and many other tasks that are part of data engineering. It is about the often-unheard open-source tools that are part of (or integrate with) the Apache data ecosystem and the role they play in the modern data stack.

Call For Papers closes May 25th 2022, so hurry-up if you want to make it!

Why do we think focus on Data Engineering is needed ?

In the world of big and small data — data is the king. Fast crunching and processing the data is at the heart of every business. There are plenty of open-source tools that focus on data processing, and they do their job marvelously. Each of the tools is a stepping stone enabling Data Scientists to make good use of the data.

However, to crunch the data in all kinds of organizations in a consistent and repeatable way, you need some ways to keep your data processing processes in order. Cleaning the data, visualization. orchestrating workloads, observability, data lineage and discovery and generally — those are not easy tasks, tasks that on the surface might look trivial or non-essential, but as your business scales, they are all indispensable for any business, and you need tools and platforms that are engineering — focused rather than data-focused in order to get your business scale without hiccups.

The Data Engineering track is all about the indispensable tools you need to use in order to get your data under control. You don’t often hear about the tools and platforms used to keep your data in check from the data scientists and analysts. The goal of those tools is to be invisible and do the job. If your data engineering tools did a good job — you rarely talk about them. So let’s talk about the Data Engineering tools and explain the role they play in a modern data stack.

What projects fall into the Data Engineering umbrella ?

We think there are many projects both Apache and non-Apache that deserve the attention of Data Engineers. We prepared a selection of such projects which we found relevant. But feel free to bring more of such projects to our attention. Let us know in private messages or comments if you think some projects deserve to be added to the list.

Apache projects:

Non-Apache Open-Source projects

  • Amundsen and other Data Governance and Discovery tools
  • Metadata: OpenMetadata, and others
  • Marquez and OpenLineage, Data Observability tools
  • Great Expectations, re_data, and other Data Quality tools
  • JupyterHub/Python (Notebook management)
  • Multiple Data Visualization Tools
  • Prefect, Alluxio and other orchestration tools
  • Your own — not yet known - tool that integrates with the ecosystem

There are also multiple non-open-source projects in this space.

If you want to submit your talk and share your experience, reminder:

Call For Papers closes May 23rd 2022, so hurry-up if you want to make it!

--

--