Data Engineering @ Community over Code conference

Jarek Potiuk
3 min readApr 3, 2023

--

As a follow up from last year, together with Ismaël Mejía we are co-chairing the Data Engineering Track at Community over Code NA (former ApacheCon) in Halifax, Nova Scotia, in October 2023.

You can see the videos from last year:

https://s.apache.org/data-engineering-videos-2022

dark mode engineering background with sparkling datapoints

Following last year ‘s Data Engineering we wanted to build on what we’ve learned last year and show our vision for the Track.

Why Data Engineering track and why @ Community over Code ?

In the last decade, many distributed databases and open-source projects emerged for processing data at scale. They’ve quickly become the standard tools we use in the industry and became the backbone of modern data processing. However, processing data is not the only task we need to build a reliable and consistent data platform.

The Data Engineering track is about the open-source tools and libraries we use to clean the data, orchestrate workloads, do observability, visualization, data lineage and many other tasks that are part of data engineering. It is about the often-unheard open-source tools that are part of (or integrate with) the open-source data ecosystem and the role they play in the modern data stack.

Call For Presentations closes 00:01 UTC on July 13th, 2023, so there is quite some time yet, but we encourage you to submit your talk now, rather than wait for the last moment!

Why do we think focus on Data Engineering is needed ?

In the world of big and small data — data is the king. Fast crunching and processing the data is at the heart of every business. There are plenty of open-source tools that focus on data processing, and they do their job marvelously. Each of the tools is a stepping stone enabling Data Scientists to make good use of the data.

However, to crunch the data in all kinds of organizations in a consistent and repeatable way, you need some ways to keep your data processing processes in order. Cleaning the data, visualization. orchestrating workloads, observability, data lineage and discovery and generally — those are not easy tasks, tasks that on the surface might look trivial or non-essential, but as your business scales, they are all indispensable for any business, and you need tools and platforms that are engineering — focused rather than data-focused in order to get your business scale without hiccups.

The Data Engineering track is all about the indispensable tools you need to use in order to get your data under control. You don’t often hear about the tools and platforms used to keep your data in check from the data scientists and analysts. The goal of those tools is to be invisible and do the job. If your data engineering tools did a good job — you rarely talk about them. So let’s talk about the Data Engineering tools and explain the role they play in a modern data stack.

What projects fall into the Data Engineering umbrella ?

We think there are many projects that deserve the attention of Data Engineers. We prepared a selection of such projects which we found relevant. But feel free to bring more of such projects to our attention. Let us know in private messages or comments if you think some projects deserve to be added to the list. Naturally we come from the Apache Software Foundation and the ASF project get first in our mind

The ASF projects:

Non-ASF Open-Source projects

  • Amundsen and other Data Governance and Discovery tools
  • Metadata: OpenMetadata, and others
  • Marquez and OpenLineage, Data Observability tools
  • Great Expectations, re_data, and other Data Quality tools
  • JupyterHub/Python (Notebook management)
  • Multiple Data Visualization Tools
  • Prefect, Alluxio and other orchestration tools
  • Your own — not yet known — tool that integrates with the ecosystem

There are also multiple non-open-source projects in this space.

If you want to submit your talk and share your experience, reminder:

Call For Presentations closes 00:01 UTC on July 13th, 2023, so there is quite some time yet, but we encourage you to submit your talk now, rather than wait for the last moment!

--

--