Apache XTable (Incubating)

Apache XTable (Incubating)

Data Infrastructure and Analytics

Menlo Park, CA 4,766 followers

Seamless cross-table interop between Apache Hudi, Delta Lake, and Apache Iceberg

About us

Apache XTable (Incubating) is a cross-table omni-directional interop of lakehouse table formats Apache Hudi, Apache Iceberg, and Delta Lake. XTable is formerly known as and recently renamed from OneTable. XTable is NOT a new or separate format, XTable provides abstractions and tools for the translation of lakehouse table format metadata. Choosing a table formats is a costly evaluation. Each project has rich features that may fit different use-cases. Some vendors use a table format as a point of lock-in. Your data should be UNIVERSAL! https://github.com/apache/incubator-xtable

Website
https://xtable.apache.org
Industry
Data Infrastructure and Analytics
Company size
11-50 employees
Headquarters
Menlo Park, CA
Type
Partnership
Founded
2023
Specialties
Data Lakehouse, Data Engineering, Lakehouse, Apache Iceberg, Apache Hudi, Delta Lake, Apache Spark, Trino, Apache Flink, and Presto

Locations

Updates

  • Apache XTable (Incubating) reposted this

    View profile for Lokesh Venkenddini, graphic

    ||Solutions Architect ||Senior Data engineer|| Databricks ||Azure|| Ryerson University alumni ||Entrepreneur ||

    Got sometime to look at XTable this weekend and here is a short summary. 🌟Apache XTable: A New Era of Interoperability Between Open Table Formats 🌟 What is Apache XTable? Apache XTable is an innovative tool designed for seamless interoperability between various lakehouse table formats, such as Apache Hudi, Iceberg, and Delta Lake. It enables users to write data in any format and convert it to multiple target formats without data duplication.Apache XTable simplifies the complexities of data management in a lakehouse environment, making it a valuable addition for businesses leveraging big data analytics. Key Features: Real-time Replication: Achieve transparent and real-time replication in any direction. Accurate and Lossless: Ensure an accurate and lossless model for your data. Extensibility: Designed to be flexible and extensible to support future formats and versions. Community-driven: Built by a neutral and inclusive community of vendors, cloud providers, and users. How It Works: 1. Setup and Configuration:   - Download and install Apache XTable from the official repository.   - Create a configuration file (e.g., `datasetConfig.yaml`) specifying your source and target formats. ""sourceFormat: delta targetFormats: - iceberg sourcePath: s3://my-bucket/delta-table"" 2. Running XTable:   - Execute the XTable command using the Java binary, pointing to your configuration file. ""java -jar xtable.jar --config datasetConfig.yaml"" 3. Data Synchronization:   - XTable supports incremental and full synchronization modes. Incremental mode is preferred for efficiency, syncing only new commits from the source. 4. Metadata Management:   - It maintains metadata for the target formats, ensuring schema updates and statistics are accurately reflected. 5. Integration:   - XTable can be integrated into data pipelines, such as Apache Airflow, allowing automated data processing. Benefits: High Throughput: Handles large volumes of structured data efficiently. Flexibility: Supports multiple formats, enhancing data accessibility. Cost-Effective: Avoids data duplication, reducing storage costs. Current Status: Supported formats include Apache Hudi, Apache Iceberg, and Delta Lake. Compatible with platforms like Apache Spark, Trino, Microsoft Fabric, Databricks, BigQuery, Snowflake, and Redshift. Features like on-demand incremental conversion, copy-on-write, and catalog integration are already in place. #ApacheXTable #DataInteroperability #DataLakehouse #OpenSource #BigData #DataManagement #JoinTheRevolution

  • Apache XTable (Incubating) reposted this

    View profile for Dipankar Mazumdar, M.Sc 🥑, graphic

    Staff Data Engineering Advocate @Onehouse.ai | Apache Hudi, Iceberg Contributor | Distributed Systems | Technical Author

    Breaking down Apache XTable's Architecture. Apache XTable (Incubating) is an omni-directional translation layer on top of open table formats such as Apache Hudi, Apache Iceberg & Delta Lake. It is NOT ❌ a new table format! Essentially what we are doing is this: SOURCE ---> (read metadata) ---> XTable's Model ---> write into TARGET We read the metadata from the SOURCE table format, put it as a unified representation & write the metadata in the TARGET format. * Note that we are only touching metadata, not the actual data files (such as #Parquet) with XTable. Let's breakdown its architecture. XTable’s architecture consists of three key components: 1. Conversion Source: ✅ These are table format specific modules responsible for reading metadata from the source ✅ They extract information like schema, transactions, partitions & translate it into XTable’s unified internal representation 2. Conversion Logic: ✅ This is the central processing unit of XTable ✅ It orchestrates the entire translation process, including initializing of all components, managing sources and targets, among other critical things 3. Conversion Target: ✅ These mirror the source readers ✅ They take the internal representation of the metadata & maps it to the target format’s metadata structure Blog in comments for a detailed read. #dataengineering #softwareengineering

    • No alternative text description for this image
  • Apache XTable provides users with the ability to translate metadata from one #lakehouse table format to another omni-directionally. What exactly happens after the XTable "Sync" Process is run? The sync process provides users with the following: ✅ Syncs the data files along with their column-level statistics and partition metadata information ✅ All the schema-level updates in the source table are reflected on to the target format metadata ✅ Metadata maintenance for the target table format. - If the target format is Apache Hudi, unreferenced files will be marked as ‘cleaned’ to control metadata table size - If the target format is Apache Iceberg, snapshots will be expired after a configured amount of time - If the target format is Delta Lake, the transaction log will be retained for a configured amount of time ⭐️ Want to try out XTable? - Here is a link to the getting started page: https://lnkd.in/gHMBQeqV #dataengineering #softwareengineering

    • No alternative text description for this image
  • Apache XTable (Incubating) reposted this

    View profile for Thomas Hass, graphic

    👨💻 Cloud Data & AI Engineer / Architect | 8x AWS certified

    ✨ Easily Switch Between Iceberg, Hudi and Delta in Your Data Platforms using Apache XTable Modern data platforms bet on open table formats to gain data warehouse-like behaviors, such as ACID transactions on their Parquet files in cheap cloud object stores. There are three leading table formats: Apache Iceberg, Apache Hudi and Delta Lake. The general concept of these table formats is very similar, as they all provide a metadata layer on top of the data. Some use cases, as well as some query engines and vendors, favor one table format over the others. This becomes a challenge in large organizations when different teams build their solutions on different table formats. This is where Apache XTable (Incubating) comes into play: Apache XTable solves this by providing a converter from any one of the three table formats to any other one (without touching the actual data). As there are not many practical demos showing how this could work, I have uploaded a YouTube video with an end-to-end guide where we start with generating data in Hudi using AWS Glue and transform it to Iceberg and Delta with XTable, and then read it with Snowflake (Iceberg) and Databricks (Delta). Check out the video and let me know what you think! 🔗 https://lnkd.in/d7cfVpj5 #Delta #Iceberg #Hudi #XTable

    • Architecture Diagram Interoperable Lakehouse
  • Apache XTable (Incubating) reposted this

    View profile for Dipankar Mazumdar, M.Sc 🥑, graphic

    Staff Data Engineering Advocate @Onehouse.ai | Apache Hudi, Iceberg Contributor | Distributed Systems | Technical Author

    Lakehouse Table Formats interoperability in Databricks. 🎯Scenario: Users in a Databricks environment uses Delta Lake to build ML workflows. However, there are certain scenarios when they need to access datasets stored in other formats like Apache Hudi & Apache Iceberg used by other teams to build robust ML models. So they should have an easy way to read any table format data without having to configure other formats or any other dependencies. ✅ This is where Apache XTable (Incubating) comes into the picture. XTable does a lightweight metadata translation that makes reading any table format as if they were Delta Lake tables. Like: spark.read.format(“delta”).load(“/mnt/mydata/<table_name>”) Read all about the use case and how-to in my blog (link in comments) #dataengineering #softwareengineering

    • No alternative text description for this image
  • Apache XTable (Incubating) reposted this

    View profile for Dipankar Mazumdar, M.Sc 🥑, graphic

    Staff Data Engineering Advocate @Onehouse.ai | Apache Hudi, Iceberg Contributor | Distributed Systems | Technical Author

    Apache XTable - What’s, Why’s and How’s? Apache Hudi, Apache Iceberg & Delta Lake provide a table-like abstraction on top of the native file formats like #Parquet. They serve as a metadata layer and offer necessary primitives for compute engines to interact with the storage. While these formats have enabled organizations to store data in an independent open tier, decoupled from compute, the decision to select and stick to one particular format is challenging. The question is — ‘Is there another way?’ I jotted all of it in the blog linked in comments. #dataengineering #softwareengineering

    • No alternative text description for this image
  • Apache XTable (Incubating) reposted this

    View organization page for Data Council, graphic

    5,636 followers

    "One table to bridge them all!" 🧙♂️✨ Ever wish you could seamlessly switch between Hudi, Delta Lake and Iceberg? The open-source project XTable makes it possible. Kyle Weller, Head of Product at Onehouse, took the stage at Data Council '24 to showcase how XTable works. This innovative project enables seamless interoperability across these major data lake formats, making your data management more efficient and versatile. Kyle's session includes: - A live demo of XTable in action - Real-world applications across Spark, Presto, Trino and Flink - Insights into the strengths and weaknesses of Hudi, Delta and Iceberg Watch the full session at the link in the comments.

    • No alternative text description for this image
    • No alternative text description for this image
    • No alternative text description for this image
    • No alternative text description for this image
  • Apache XTable (Incubating) reposted this

    View profile for Georgi Mullassery, graphic

    Martech I Director - Analytics at IPG | Ex-IBMer

    Migrating from Iceberg to Delta for Efficient Incremental Order Processing ---------------------- An e-commerce company stores customer order data in an Apache Iceberg table on #AWSGlueCatalog. This data is critical for order fulfillment, inventory management, and customer insights. The company processes a high volume of orders daily, and needs an efficient system to handle updates. Initial Setup: The data team creates an Iceberg table in Glue Catalog to store the order data. Migration to Delta: To streamline incremental order processing, the team decides to convert the Iceberg table to a Delta table using X-Table. This one-time conversion creates a Delta table with the same data as the Iceberg table. Loading New Orders: Daily, new orders are placed on the e-commerce platform. The data team continues to load this new data into the original Iceberg table. Incremental Order Processing with Delta: Since the Iceberg table now points to the Delta table behind the scenes (thanks to X-Table), any queries or data loads targeting the Iceberg table automatically interact with the Delta table. X-Table efficiently identifies and loads only the incremental data (new or updated orders) into the Delta table. This approach offers several benefits: Simplified Data Pipeline: The data team can maintain their existing Iceberg table-based data pipeline for loading new orders. Efficient Incremental Processing: X-Table ensures only incremental data is loaded into Delta, optimizing processing time and costs associated with big data. Unlocking Delta Lake Features: The team can now leverage Delta Lake's functionalities like ACID transactions, data versioning, and efficient data repairs for better data management and analytics, ensuring data accuracy for critical business decisions.

Similar pages