Apache Doris’ Post

View organization page for Apache Doris, graphic

2,347 followers

1mo

Always nice to see approachable and easy-to-follow guides for data engineering beginners. 👍 After introducing how to build your first data platform using MySQL, Apache Doris, and Apache Flink, Mohamed Amine Turki moves a little upstream and explains on Change Data Capture (CDC) with step-by-step instructions. https://lnkd.in/g9igQNjY (P.S. Apache Doris provides a Flink-Doris-Connector with built-in CDC support: https://lnkd.in/gmPESd3V) #dataengineering #beginner #Flink #CDC

Data Platform 101: Change Data Capture

dataplatformhub.medium.com

1 Comment

Mohamed Amine Turki

Senior Data Infra Engineer - Helping young professionals access the Data World

1mo

appreciate the support 😊

1 Reaction

To view or add a comment, sign in

More Relevant Posts

Ameya Kawaleyy

Director, Architecture and DevOps
7mo
Report this post
Apache Hudi is transforming distributed data management! With its scalable, real-time solutions, businesses can elevate their data game and achieve their management goals. Apache Hudi ensures data quality, drives analytics excellence, and transforms data workflows. Follow posts from Shiyan Xu on Apache Hudi to gain deeper insights. Connect with me to learn how Apache Hudi can help you manage your data. #ApacheHudi #DataManagement #BigData 🚀

Apache Hudi: From Zero To One (1/10)

blog.datumagic.com

2 Comments
Like Comment
To view or add a comment, sign in
ScyllaDB

18,229 followers
9mo
Report this post
ScyllaDB, along with #ApacheCassandra, HBase, RocksDB, and #CockroachDB, use the log-structured merge-tree data structure. Learn more about the LSM Tree in this free #ScyllaDB University lesson: https://ow.ly/Ky7y50PBVcE #TechTips #NoSQL #TechTraining #database

[Free Lesson] Learn about the LSM Tree in ScyllaDB University

https://university.scylladb.com
Like Comment
To view or add a comment, sign in
ScyllaDB

18,229 followers
11mo
Report this post
💡 Tuesday tech tip: ScyllaDB, along with #ApacheCassandra, HBase, RocksDB, and #CockroachDB, use the log-structured merge-tree data structure. Learn more about the LSM Tree in this free #ScyllaDB University lesson: https://ow.ly/Ky7y50PBVcE #TechTips #NoSQL #TechTraining #database

[Free Lesson] Learn about the LSM Tree in ScyllaDB University

https://university.scylladb.com
Like Comment
To view or add a comment, sign in
Raj Tiwari

Lead - Big Data & Connectors at Lumenore - A Netlink Platform , MTech in Data Science & Engineering
2w
Report this post
🔗 New read on Medium! 📘 I've just published an article exploring the integration of Apache Spark with DuckDB to optimize data analytics operations. This setup proves to be a game-changer for handling vast datasets with incredible efficiency. 🚀 Highlights include: - Step-by-step guide on combining Apache Spark with DuckDB - Insights into the performance benefits of this integration Ideal for data professionals looking to enhance their toolkit. Check it out and let me know your thoughts or experiences with these technologies! 👉 Read the full article here: Enhancing Data Analytics: Connecting Apache Spark and DuckDB for Optimal Performance - (https://lnkd.in/d3cuhWpx) #DataEngineering #BigData #TechInnovation #ApacheSpark #DuckDB

Enhancing Data Analytics: Connecting Apache Spark and DuckDB for Optimal Performance

medium.com

2 Comments
Like Comment
To view or add a comment, sign in
Sherry Quach

Data Analyst at Knowi
8mo
Report this post
Let's make the lives of data engineers easier! At Knowi , we've got the strategy: 1️⃣ Forget about ETL/ELT. You don't need it with Knowi. There are no complex workflows or convoluted steps just to pull your data in. — 2️⃣ Your data should stay where it is, and you only bring in what's necessary. Let's keep everything efficient and minimize the headache of data migration. — 3️⃣ Use the native query language of your database. Be it Mongo, Elastic, SQL, or our in-house SQL-like syntax. We're all about flexibility. — 4️⃣ Forget maintaining connectors for SQL, API, or NoSQL data. We handle that for you so your engineers can focus on what matters. — 5️⃣ No more silos or disjointed data. We're here to help you pull from and join all your data sources, including both NoSQL and SQL. — At Knowi , we're making data engineering smoother, faster, and more efficient – giving engineers the tools to excel without the added complications. Discover a new way to handle your data today. Book a demo: https://lnkd.in/gtCgZYEA #EngineeringSimplified #ETL #DataQuery #BusinessIntelligence
Like Comment
To view or add a comment, sign in
Craig Brown

Executive and Thought Leadership in "Gen AI", "Machine Learning", "Artificial Intelligence", "Data Science", "Cloud", "Data Analytics" "MLOps", "AIOps"
8mo
Report this post
#Technology #DataAnalytics #DataDriven Building a Batch Data Pipeline with Athena and MySQL: An End-To-End Tutorial for Beginners Continue reading on Towards Data Science » #MachineLearning #ArtificialIntelligence #DataScience

Building a Batch Data Pipeline with Athena and MySQL

towardsdatascience.com
Like Comment
To view or add a comment, sign in
Craig Brown

Executive and Thought Leadership in "Gen AI", "Machine Learning", "Artificial Intelligence", "Data Science", "Cloud", "Data Analytics" "MLOps", "AIOps"
9mo
Report this post
Building a Batch Data Pipeline with Athena and MySQL: An End-To-End Tutorial for Beginners Continue reading on Towards Data Science » #MachineLearning #ArtificialIntelligence #DataScience

Building a Batch Data Pipeline with Athena and MySQL

towardsdatascience.com
Like Comment
To view or add a comment, sign in
ScyllaDB

18,229 followers
3mo
Report this post
💡 Tuesday tech tip: #ScyllaDB provides the functionality to automatically delete expired data according to Time to Live (TTL). The TTL can be set when defining a Table or using the INSERT and UPDATE queries, as seen in this free ScyllaDB University lesson: http://ow.ly/Kmbb50MWTHH #TechTip #NoSQL #NoSQLdatabase

Expiring Data with TTL (Time to Live) - ScyllaDB University

university.scylladb.com
Like Comment
To view or add a comment, sign in
Nicholas Leong
11mo
Report this post
🔍 Delta Lake vs. Apache Hudi: The Next Step to a Data Lakehouse 🏔️ My comparison results - Read/Write Performance Achieving a seamless and cost-effective data architecture is paramount in our data-driven world. That's why our data team is building a data lakehouse as the future of data management. This is so that we can 🔸 Save cost 🔸 Scale 🔸 Avoid vendor lock If you haven't heard about data lakehouse technologies and open table formats, here's a quick run-through: 🏠 What's a Data Lakehouse? A data lakehouse combines the strengths of data lakes with the functionality of traditional data warehouses. It offers the best of both worlds—scalability and cost-effectiveness, without compromising on key features like ACID transactions, Schema Enforcement, and BI Support. 🚀🏢 📂 Open Table Formats Explained Open formats like Delta Lake, Apache Hudi, and Apache Iceberg are paving the way for organized and efficient data lakes. Delta Lake, for instance, utilizes log files and data files for effective data management, while maintaining historical changes through changelogs. 📊📁 🔍 The Experiment and Results We ran a comprehensive test on 23.4GB of Parquet data using AWS Glue with 3 G1X workers (3 DPU). Our experiment covered bulk writes/inserts, upserts, and read operations across different stages. 📈 The Verdict When it comes to Read/Write Performance, Delta Lake outshines Hudi ✅🚀 Regardless of whether you opt for the MoR or CoW version, Delta Lake consistently demonstrates superior performance, especially as data scales. 📊 🧩 Discovering Hudi's Cool Secrets Although Delta Lake takes the crown in this performance battle, it's worth noting that Hudi boasts unique benefits beneath the surface. For one, it assigns a unique identifier for every record in your data, while delta doesn't. Let me know if you would like to know more! Nicholas Leong #deltalake #dataengineering #datascience #dataanalytics #apachespark
Like Comment
To view or add a comment, sign in
Agathamudi Leela Vara Prasad

Microsoft Certified Azure Data Engineer(DP-203) | Python | SQL | Big Data |Azure Data Factory | Azure Databricks | Spark-SQL | ADLS | Pyspark | ETL | Hadoop | Hive | PowerBI
3mo
Report this post
✴ Z-ordering optimization in Delta Lake involves organizing data files within a table based on a Z-ordering or Morton ordering scheme. This technique arranges data in a linear sequence, optimizing data retrieval and processing for spatial or range-based queries. Here's how Z-ordering works in Delta Lake optimization: 1️⃣ Spatial Indexing: Delta Lake leverages Z-ordering to organize data files in a multidimensional space, such as geographical coordinates or timestamp ranges. By interleaving the bits of each dimension's value, Z-ordering creates a linear sequence that preserves locality, ensuring nearby data points are stored close to each other in the sequence. 2️⃣ Data Skew Reduction: Z-ordering helps reduce data skew by evenly distributing data across partitions or files based on their Z-ordering values. This minimizes hot spots and imbalance in data distribution, leading to more balanced query execution and improved parallelism. 3️⃣ Range-Based Query Optimization: Z-ordering optimizes range-based queries by clustering related data points together in the linear sequence. This reduces the amount of data that needs to be scanned during query execution, resulting in faster query performance and reduced I/O overhead. 4️⃣ Predicate Pushdown: Delta Lake's Z-ordering optimization allows for efficient predicate pushdown, where query predicates are pushed down to the data files based on their Z-ordering values. This enables the query engine to skip irrelevant data files during query execution, further improving query performance. 5️⃣ Delta Lake Integration: Z-ordering optimization is seamlessly integrated into Delta Lake, allowing users to enable it for specific tables or partitions using configuration options or DDL commands. Delta Lake automatically manages Z-ordering metadata and file organization, simplifying optimization and maintenance tasks for data engineers and administrators. Overall, Z-ordering optimization in Delta Lake provides an efficient and scalable approach to organizing and accessing multidimensional data, leading to improved query performance, reduced data skew, and better resource utilization in data lake environments. #sql #mysql #dataanalytics #database #datascience #dataanalysis #dataengineering #dataanalyst #bigdata #cloud #bigdataengineer #database #datawarehouse #datalake #pyspark #python
Like Comment
To view or add a comment, sign in

2,347 followers

View Profile Follow

Apache Doris’ Post

Data Platform 101: Change Data Capture

dataplatformhub.medium.com

More from this author

Less Components, Higher Performance: Apache Doris Instead of ClickHouse, MySQL, Presto, and HBase

Introduction to Apache Doris: A Next-Generation Real-Time Data Warehouse

LLM-Powered OLAP: the Tencent Experience with Apache Doris

Explore topics