Apache Hudi

Apache Hudi

Data Infrastructure and Analytics

San Francisco, CA 8,037 followers

Open source pioneer of the lakehouse reimagining batch processing with incremental framework for low latency analytics

About us

Open source pioneer of the lakehouse reimagining old-school batch processing with a powerful new incremental framework for low latency analytics. Hudi brings database and data warehouse capabilities to the data lake making it possible to create a unified data lakehouse for ETL, analytics, AI/ML, and more. Apache Hudi is battle-tested at scale powering some of the largest data lakes on the planet. Apache Hudi provides an open foundation that seamlessly connects to all other popular open source tools such as Spark, Presto, Trino, Flink, Hive, and so much more. Being an open source table format is not enough, Apache Hudi is also a comprehensive platform of open services and tools that are necessary to operate your data lakehouse in production at scale. Most importantly, Apache Hudi is a community built by a diverse group of engineers from all around the globe! Hudi is a friendly and inviting open source community that is growing every day. Join the community in Github: https://github.com/apache/hudi or find links to email lists and slack channels on the Hudi website: https://hudi.apache.org/

Website
https://hudi.apache.org/
Industry
Data Infrastructure and Analytics
Company size
201-500 employees
Headquarters
San Francisco, CA
Type
Nonprofit
Founded
2016
Specialties
ApacheHudi, DataEngineering, ApacheSpark, ApacheFlink, TrinoDB, Presto, DataAnalytics, DataLakehouse, AWS, GCP, Azure, ChangeDataCapture, and StreamProcessing

Locations

Employees at Apache Hudi

Updates

  • Apache Hudi reposted this

    View profile for Dipankar Mazumdar, M.Sc 🥑, graphic

    Staff Data Engineering Advocate @Onehouse.ai | Apache Hudi, Iceberg Contributor | Distributed Systems | Technical Author

    Apache Hudi’s Open Lakehouse Platform. A Lakehouse architecture enables users to have the best of (data warehouses + data lakes). This way, it also addresses some of the pressing issues (transactional guarantees, support for unstructured data, etc.) for both of these architectures. One of the core ingredient in a lakehouse platform is the "open table format" that helps track metadata for the actual #Parquet data files. Hudi offers an open table format, but it is also important to note that Hudi is much more than a *generic table format*. Hudi brings core warehouse & database functionality directly to a #datalake, acting as a transactional layer over open file formats like Parquet/ORC, providing critical capabilities such as updates/deletes. ✅ Other than the table format, Hudi also includes essential table services that are tightly integrated with the database kernel. ✅ These services can be executed automatically across both ingested and derived data to manage various aspects such as table bookkeeping, metadata, and storage layout. ✅ On top of the table format & table management services, Hudi also offers various platform-specific services to deal with things like data ingestion, catalog syncing, data quality checks, import/export tools. All these components extends Hudi's role from being just a 'table format' to a comprehensive & robust lakehouse platform. 📗 Read more about the Hudi Stack in comments. #dataengineering #softwareengineering

    • No alternative text description for this image
  • View organization page for Apache Hudi, graphic

    8,037 followers

    🚀 Don’t Miss the Next Hudi Community Sync! Mark your calendars for August 20th, 2024, as we dive into another exciting community sync session! We’re thrilled to have Ankit Shrivastava, Senior Software Engineer at Uber, share insights on building low-latency solutions for complex data workflows using Apache Hudi. 📅 Date/Time: August 20th, 9 AM PT | 12 PM ET 🔗 Link: https://lnkd.in/eeujPjmn #dataengineering #softwareengineering

    • No alternative text description for this image
  • Apache Hudi reposted this

    View profile for Shashank Mishra 🇮🇳, graphic

    Data Engineer @ Prophecy🕵️♂️ Building GrowDataSkills 🎥 YouTuber (173k+ Subs)📚Taught Data Engineering to more than 10K+ Students 🎤 Public Speaker 👨💻 Ex-Expedia, Amazon, McKinsey, PayTm

    Uber has pioneered a cutting-edge data lakehouse architecture, leveraging the power of Apache Hudi to streamline and enhance their data pipelines. Here’s a breakdown of this innovative approach 👇🏻 1️⃣ Data Lakehouse Overview : Uber’s data lakehouse integrates the best of data lakes and data warehouses, offering both scalability and structured query capabilities. This architecture ensures data is fresh and readily accessible for analytics and machine learning tasks. 2️⃣ Apache Hudi : It plays a pivotal role by enabling efficient data management and processing. It supports two primary table types 👇🏻 ✅ Copy-on-Write (CoW): Ideal for batch processing, rewriting entire files for updates, and ensuring high query performance. ✅ Merge-on-Read (MoR): Combines real-time data updates with asynchronous compaction, reducing I/O overhead and maintaining data freshness. 3️⃣ Incremental Processing : With Hudi’s incremental processing, Uber achieves 👇🏻 ✅ Faster data updates without rewriting entire datasets. ✅ Reduced operational overhead by efficiently handling small, frequent updates. ✅ Enhanced data quality and latency for critical applications. 4️⃣ Compaction Strategies : Hudi’s compaction optimizes data by converting updates from log files to columnar formats, balancing between real-time data availability and query performance. 5️⃣ Seamless Integration : Hudi seamlessly integrates with Apache Spark and other big data tools, supporting Uber’s large-scale data needs with minimal latency and high efficiency. 6️⃣ Benefits Realized : With this innovative blend of Data Lakehouse with Hudi, Uber has: ✅ Reduced ETL pipeline runtimes by 50%. ✅ Decreased SLA times by 60%. ✅ Ensured 100% data completeness across their systems. 🔗 Read this full blog - https://lnkd.in/du-YNepQ 🚨 We are conducting Masterclass for Industry Level Tableau Project 👇🏻 🎓 Register Here - https://lnkd.in/gdbD-DRh 📅 Date: 11-Aug-2024 ⏰ Time: 7 PM IST 🚨 Join "Tableau Project For Data Analyst BootCAMP" TODAY with Early bird offer, use code "EARLY200" to get some exclusive discount, VALID FOR 2 Days ONLY !!! 📚 Course Curriculum - https://lnkd.in/gBWyhh9a 📝 Enroll Here - https://lnkd.in/gvVPvECb 📅 Live Classes Starting from 24-Aug-2024 📲 Call/WhatsApp for any Query (+91)-9893181542 Cheers - Grow Data Skills 😎 #dataengineering #systemdesign #datalake #datapipeline

    • No alternative text description for this image
  • Apache Hudi reposted this

    View profile for Dipankar Mazumdar, M.Sc 🥑, graphic

    Staff Data Engineering Advocate @Onehouse.ai | Apache Hudi, Iceberg Contributor | Distributed Systems | Technical Author

    Storage Engine in Database, Data Warehouse & Lakehouse. Storage engine is a comparatively less talked component in a database management system. But it is actually the backbone of any database (OLTP/OLAP) responsible for how data is stored, retrieved, and managed on disk. Some of its critical capabilities include: ⭐️ ACID Transactions: ensures atomicity, consistency, isolation & durability of transactions. ⭐️ Concurrency Control: Manages simultaneous data access/updates to ensure data integrity. ⭐️ Locking: Implements mechanisms to handle multiple transactions without conflicts. ⭐️ Indexing: Provides efficient data retrieval mechanisms for query performance ⭐️ Clustering: Optimizes storage layout for optimized access patterns. ⭐️ Cleaning: Maintains and cleans up data to ensure efficient storage management. If you have worked in relational dbs such as MySQL you are probably aware of 2 of the widely used storage engines - InnoDB & MyIASM. Coming to the OLAP world, data warehouses also have their native storage engine that handles things like concurrency control, ACID transactions, locking, amongst others. A recent paper by Google highlights a new Storage Engine for Big Query called "Vortex" (link in comments) that is designed particularly for real-time analytics. So, what's happening in a Lakehouse? Lakehouse is always defined as "Best(Warehouse + Lake)". i.e. we get the scalability & cost benefits of data lakes and transactional capabilities of a warehouse. Storage engine plays a major role in this. ✅ Storage engine is a critical component in a lakehouse architecture that helps optimally organize data in cloud data lakes such as S3, GCS, Azure. ✅ Lakehouse platforms such as Apache Hudi has its own storage engine that allows things clustering, compaction, cleaning, concurrency control which is very important. ✅ Open Table Format (metadata) + Storage engine = Transactional database layer. All of this is what brings those consistency guarantees in transactions in a lakehouse. For e.g. two engines writing to the same table at the same time (concurrent writes). The storage engine is responsible for organizing the data & keeping all the files and data structures up to date. #dataengineering #softwareengineering

    • No alternative text description for this image
  • Apache Hudi reposted this

    View profile for Dipankar Mazumdar, M.Sc 🥑, graphic

    Staff Data Engineering Advocate @Onehouse.ai | Apache Hudi, Iceberg Contributor | Distributed Systems | Technical Author

    Data ingestion to a Lakehouse. There are a numerous way to ingest data from a variety of sources to open #lakehouse platforms like Apache Hudi - Spark, Flink jobs. However, Hudi's platform also natively offers a utility called "Hudi Streamer" for efficient data ingestion. 𝐇𝐞𝐫𝐞 𝐚𝐫𝐞 5 𝐭𝐡𝐢𝐧𝐠𝐬 𝐭𝐨 𝐤𝐧𝐨𝐰 𝐚𝐛𝐨𝐮𝐭 𝐇𝐮𝐝𝐢 𝐒𝐭𝐫𝐞𝐚𝐦𝐞𝐫: ✅ It enables you to ingest data from sources such as distributed file system, Kafka, Pulsar, JDBC, etc. ✅ Hudi Streamer is part of the 'Platform services' that sits atop table services (such as clustering, file sizing) interfacing with writers & readers. ✅ It support functionalities like automatic checkpoint management, integration with schema registries such as Confluent, and deduplication of data. ✅ There are also features for backfills, one-off runs, & continuous mode operation with Spark/Flink streaming writers. ✅ Allows creating both Copy-on-Write (CoW) & Merge-on-Read (MoR) tables These type of platform services really makes Hudi unique from other metadata table formats. Hudi Streamer is also one of the widely used utility that comes natively with the 'table format' along with other transactional capabilities. Linked a great blog (in comments) from Amazon Web Services (AWS) on how to use Hudi Streamer with AWS Glue to ingest streaming data from Amazon MSK. #dataengineering #softwareengineering

    • No alternative text description for this image
  • View organization page for Apache Hudi, graphic

    8,037 followers

    🚨Upcoming Hudi Community Sync 🚨 Join us on 20th August 2024 for the next community sync. In this session, we will have Ankit Shrivastava - Senior Software Engineer at Uber to present on how they leverage Apache Hudi to tackle the challenges of read & write amplification in big data. Ankit will elaborate how Hudi can be used to build low latency solutions for complex data workflows involving complex transformations. 👉 Link: https://lnkd.in/eeujPjmn 🗓️ Date/Time: 20th August, 9 AM PT | 12 PM ET #dataengineering #softwareengineering

    • No alternative text description for this image
  • Apache Hudi reposted this

    View profile for Dipankar Mazumdar, M.Sc 🥑, graphic

    Staff Data Engineering Advocate @Onehouse.ai | Apache Hudi, Iceberg Contributor | Distributed Systems | Technical Author

    Tuning Spark jobs with Apache Hudi. Running distributed ETL jobs with compute engines such as Apache Spark needs you to keep a tab on 2 factors: - performance - reliability If not tuned properly, these Spark jobs can run for hours, consuming a lot of computing resources. This inefficiency can also significantly impact downstream applications with wrong/partial data writes. Effectively tuning the configurations can help manage memory usage and prevent out-of-memory errors. This enhances overall job stability, making your data processing pipelines more robust. When Spark is used with #lakehouse platforms like Apache Hudi, general rules of debugging applies as well. Here is a list of 'items' to keep in mind in order to ensure better performance and reliability. You can find the exact configs in the attached image. ✅ Input Parallelism: Hudi follows the parallelism settings of Spark for processing input data. If the input data size is large, increasing the input parallelism can help avoid performance bottlenecks caused by shuffling operations. ✅ Off-Heap Memory: Writing #Parquet files requires substantial off-heap memory (memory allocated outside of the JVM heap). This is especially important when dealing with wide schemas. ✅ Spark Memory: Hudi operations, such as merges or compactions, need enough memory to read and process individual files. Additionally, Hudi caches input data to optimize data placement, which requires memory. ✅ Sizing Files: The target size of the files written by Hudi should be set carefully to balance between write performance and the number of files generated. Smaller files can reduce write latency but may increase metadata overhead. ✅ Timeseries/Log Data: For timeseries, event, or log data, which usually involve smaller but more numerous records, tuning bloom filter or using a bucketed index is recommended. A lot of these recommendations also applies to other formats like Apache Iceberg & Delta Lake. Have you worked with Spark & lakehouse formats and benefitted out of any other Spark-related config? Share in comments! #dataengineering #softwareengineering

    • No alternative text description for this image
  • View organization page for Apache Hudi, graphic

    8,037 followers

    In this talk, Ankit, a Senior Software Engineer at Uber, will present how they leverage Apache Hudi to tackle the challenges of read and write amplification in big data. Discover how they handle late-arriving data and upserts, especially with massive volumes and complex transformations. Learn about their sophisticated data modeling and functional rule engine-based transformation strategy, which together have enabled the creation of a robust, low-latency solution. Join us on August 20th, Tuesday at 9 AM Pacific Time to learn how Apache Hudi can be used to build low latency solutions for complex data workflows involving complex transformations.

    Scaling Complex Data Workflows using Apache Hudi

    Scaling Complex Data Workflows using Apache Hudi

    www.linkedin.com

  • Apache Hudi reposted this

    View profile for Sivabalan Narayanan, graphic

    Onehouse, Apache Hudi, Ex-Uber, Ex-Linkedin - We are Hiring

    Apache Hudi's Deltastreamer (renamed to HoodieStreamer) is an incredible tool to continuously ingest data from your sources into Hudi tables. Lot of blogs and articles and use-cases have already been written on this. This is from me on 10000 ft view of the tool and its purpose and benefits. We have always envisioned Apache Hudi to be a platform and not mere table format. You would not believe if I tell you, we have built this ingestion utility way back in 2017 even w/ 0.5.0 release of Apace Hudi which can do auto ingestion in a continuous loop, manages checkpoint and ensure data integrity on failures, does schema management, provides transformation capabilities, auto and async compaction, auto and async clustering, and much more. Can someone point to a similar utility for other table formats in this space? (I know LakeFlow from DB is the newest kid in the block, but its pretty new and is proprietary) It begs the question if we are truely looking to innovate and work towards community needs in open source, or the vendor eco system and paid services plays a hand, so such needs of OSS users never sees light of the day. OR it could be that these are left to be bare minimum table formats in open source and all such platform components are left to proprietary and open sources users are either pushed to build their own or pay for these services. Anyways, we (Apache Hudi) are happy our uses are able to leverage such a tool to build their data architecture and fulfill their needs and use-cases. https://lnkd.in/gpdziw3e Linking few of community posts using HoodieStreamer: https://lnkd.in/g5BFQGbW https://lnkd.in/gJBj32Aa https://lnkd.in/gGWdmBq4 https://lnkd.in/g6hq2Axe https://lnkd.in/gTRYeVBC

    Apache Hudi Deltastreamer #Datalake #ApacheHudi #DeltaStreamer #IncrementalIngest #CDC

    Apache Hudi Deltastreamer #Datalake #ApacheHudi #DeltaStreamer #IncrementalIngest #CDC

    medium.com

Similar pages