skip to main content
research-article

Apache Arrow DataFusion: A Fast, Embeddable, Modular Analytic Query Engine

Published: 09 June 2024 Publication History
  • Get Citation Alerts
  • Abstract

    Apache Arrow DataFusion is a fast, embeddable, and extensible query engine written in Rust that uses Apache Arrow as its memory model. In this paper we describe the technologies on which it is built, and how it fits in long-term database implementation trends. We then enumerate its features, optimizations, architecture and extension APIs to illustrate the breadth of requirements of modern OLAP engines as well as the interfaces needed by systems built with them. Finally, we demonstrate open standards and extensible design do not preclude state-of-the-art performance using a series of experimental comparisons to DuckDB.
    While the individual techniques used in DataFusion have been previously described many times, it differs from other industrial strength engines by providing competitive performance and an open architecture that can be customized using more than 10 major extension APIs. This flexibility has led to use in many commercial and open source databases, machine learning pipelines, and other data-intensive systems. We anticipate that the accessibility and versatility of DataFusion, along with its competitive performance, will further the proliferation of high-performance custom data infrastructures tailored to specific needs assembled from modular components. While the individual techniques used in DataFusion have been previously described many times, it differs from other industrial strength engines by providing competitive performance and an open architecture that can be customized using more than 10 major extension APIs. This flexibility has led to use in many commercial and open source databases, machine learning pipelines, and other data-intensive systems. We anticipate that the accessibility and versatility of DataFusion, along with its competitive performance, will further the proliferation of high-performance custom data infrastructures tailored to specific needs assembled from modular components.

    References

    [1]
    Daniel J. Abadi, Daniel S. Myers, David J. DeWitt, and Samuel Madden. 2007. Materialization Strategies in a Column-Oriented DBMS. In Proceedings of the 23rd International Conference on Data Engineering, ICDE 2007, The Marmara Hotel, Istanbul, Turkey, April 15--20, 2007, Rada Chirkova, Asuman Dogac, M. Tamer Özsu, and Timos K. Sellis (Eds.). IEEE Computer Society, 466--475. https://doi. org/10.1109/ICDE.2007.367892
    [2]
    Mustafa Akur and Mehmet Ozan Kabak. 2023. Running Windowing Queries in Stream Processing. https://www.synnada.ai/blog/running-window-query-instream- processing
    [3]
    Michael Armbrust, Tathagata Das, Sameer Paranjpye, Reynold Xin, Shixiong Zhu, Ali Ghodsi, Burak Yavuz, Mukul Murthy, Joseph Torres, Liwen Sun, Peter A. Boncz, Mostafa Mokhtar, Herman Van Hovell, Adrian Ionescu, Alicja Luszczak, Michal Switakowski, Takuya Ueshin, Xiao Li, Michal Szafranski, Pieter Senster, and Matei Zaharia. 2020. Delta Lake: High-Performance ACID Table Storage over Cloud Object Stores. Proc. VLDB Endow. 13, 12 (2020), 3411--3424. https: //doi.org/10.14778/3415478.3415560
    [4]
    Arroyo. 2023. Arroyo - Serverless Stream Processing. https://www.arroyo.dev/
    [5]
    The Blaze Authors. 2023. The Blaze accelerator for Apache Spark. https://github. com/blaze-init/blaze
    [6]
    Edmon Begoli, Jesús Camacho-Rodríguez, Julian Hyde, Michael J. Mior, and Daniel Lemire. 2018. Apache Calcite: A Foundational Framework for Optimized Query Processing Over Heterogeneous Data Sources. In Proceedings of the 2018 International Conference on Management of Data, SIGMOD Conference 2018, Houston, TX, USA, June 10--15, 2018, Gautam Das, Christopher M. Jermaine, and Philip A. Bernstein (Eds.). ACM, 221--230. https://doi.org/10.1145/3183713.3190662
    [7]
    Alexander Behm, Shoumik Palkar, Utkarsh Agarwal, Timothy Armstrong, David Cashman, Ankur Dave, Todd Greenstein, Shant Hovsepian, Ryan Johnson, Arvind Sai Krishnan, Paul Leventis, Ala Luszczak, Prashanth Menon, Mostafa Mokhtar, Gene Pang, Sameer Paranjpye, Greg Rahn, Bart Samwel, Tom van Bussel, Herman Van Hovell, Maryann Xue, Reynold Xin, and Matei Zaharia. 2022. Photon: A Fast Query Engine for Lakehouse Systems. In SIGMOD '22: International Conference on Management of Data, Philadelphia, PA, USA, June 12 - 17, 2022, Zachary G. Ives, Angela Bonifati, and Amr El Abbadi (Eds.). ACM, 2326--2339. https://doi.org/10.1145/3514221.3526054
    [8]
    Tyler A. Cabutto, Sean P. Heeney, Shaun V. Ault, Guifen Mao, and Jin Wang. 2018. An Overview of the Julia Programming Language. In Proceedings of the 2018 International Conference on Computing and Big Data (Charleston, SC, USA) (ICCBD '18). Association for Computing Machinery, New York, NY, USA, 87--91. https://doi.org/10.1145/3277104.3277119
    [9]
    Jesús Camacho-Rodríguez, Ashutosh Chauhan, Alan Gates, Eugene Koifman, Owen O'Malley, Vineet Garg, Zoltan Haindrich, Sergey Shelukhin, Prasanth Jayachandran, Siddharth Seth, Deepak Jaiswal, Slim Bouguerra, Nishant Bangarwa, Sankar Hariappan, Anishek Agarwal, Jason Dere, Daniel Dai, Thejas Nair, Nita Dembla, Gopal Vijayaraghavan, and Günther Hagleitner. 2019. Apache Hive: From MapReduce to Enterprise-grade Big Data Warehousing. In Proceedings of the 2019 International Conference on Management of Data, SIGMOD Conference 2019, Amsterdam, The Netherlands, June 30 - July 5, 2019, Peter A. Boncz, Stefan Manegold, Anastasia Ailamaki, Amol Deshpande, and Tim Kraska (Eds.). ACM, 1773--1786. https://doi.org/10.1145/3299869.3314045
    [10]
    Surajit Chaudhuri and Gerhard Weikum. 2000. Rethinking Database System Architecture: Towards a Self-Tuning RISC-Style Database System. In VLDB 2000, Proceedings of 26th International Conference on Very Large Data Bases, September 10--14, 2000, Cairo, Egypt, Amr El Abbadi, Michael L. Brodie, Sharma Chakravarthy, Umeshwar Dayal, Nabil Kamel, Gunter Schlageter, and Kyu-Young Whang (Eds.). Morgan Kaufmann, 1--10. http://www.vldb.org/conf/2000/P001.pdf
    [11]
    The Apache Arrow DataFusion Comet. 2024. The Comet accelerator for Apache Spark. https://github.com/apache/arrow-datafusion-comet
    [12]
    The Rust community. 2023. Cargo: Rust's built-in package manager. https: //crates.io/
    [13]
    Coralogix. 2023. Coralogix - Full-Stack Observability Platform with In-Stream Data Analytics. https://coralogix.com
    [14]
    IBM Corporation. 2023. IBM DB2. https://www.ibm.com/products/db2
    [15]
    Microsoft Corporation. 2023. Microsoft SQL Server. https://www.microsoft.com/ en-us/sql-server
    [16]
    Oracle Corporation. 2023. The Oracle Database Server. https://www.oracle.com/ database/
    [17]
    The Transaction Processing Council. 2023. The TPC-H Benchmark. https://www. tpc.org/tpch/
    [18]
    Voltron Data. 2023. The Composable Codex. https://voltrondata.com/codex
    [19]
    Jeffrey Dean and Sanjay Ghemawat. 2004. MapReduce: Simplified Data Processing on Large Clusters. (2004), 137--150. http://www.usenix.org/events/osdi04/tech/ dean.html
    [20]
    Delta-rs. 2024. A native Rust library for Delta Lake. https://github.com/deltaio/ delta-rs
    [21]
    Arrow developers. 2023. Mailing list: [DISCUSS][Format] Starting the draft implementation of the ArrayView array format. https://lists.apache.org/thread/ r28rw5n39jwtvn08oljl09d4q2c1ysvb
    [22]
    Substrait Developers. 2023. Substrait: Cross-Language Serialization for Relational Algebra. https://substrait.io/
    [23]
    David J. DeWitt and Michael Stonebraker. 2008. MapReduce: A major step backwards. https://homes.cs.washington.edu/~billhowe/mapreduce_a_major_step_ backwards.html
    [24]
    Apache Software Foundation. 2023. Apache Arrow. https://arrow.apache.org
    [25]
    Apache Software Foundation. 2023. Apache Arrow DataFusion. https://arrow. apache.org/datafusion/
    [26]
    Apache Software Foundation. 2023. Apache Parquet. https://parquet.apache.org
    [27]
    Apache Software Foundation. 2023. A Primer on ASF Governance. https://www. apache.org/foundation/governance/
    [28]
    Apache Software Foundation. 2023. PyArrow - Apache Arrow Python bindings. https://arrow.apache.org/docs/python/index.html
    [29]
    The Apache Software Foundation. 2023. Apache DataFusion SQL reference. https: //arrow.apache.org/datafusion/user-guide/sql/index.html
    [30]
    The Apache Software Foundation. 2024. Apache Iceberg: The open table format for analytic datasets. https://iceberg.apache.org/
    [31]
    The Zig Software Foundation. 2023. The Zig programming language. https: //ziglang.org/
    [32]
    Rust futures crate. 2023. Stream trait. https://docs.rs/futures/0.3.28/futures/ prelude/stream/trait.Stream.html
    [33]
    Goetz Graefe. 1990. Encapsulation of Parallelism in the Volcano Query Processing System. In Proceedings of the 1990 ACM SIGMOD International Conference on Management of Data, Atlantic City, NJ, USA, May 23--25, 1990, Hector Garcia- Molina and H. V. Jagadish (Eds.). ACM Press, 102--111. https://doi.org/10.1145/ 93597.98720
    [34]
    Goetz Graefe. 2006. Implementing sorting in database systems. ACM Comput. Surv. 38, 3 (2006), 10. https://doi.org/10.1145/1132960.1132964
    [35]
    H2O.ai. 2023. Database-like ops benchmark. https://h2oai.github.io/dbbenchmark/
    [36]
    Stratos Idreos, Fabian Groffen, Niels Nes, Stefan Manegold, K. Sjoerd Mullender, and Martin L. Kersten. 2012. MonetDB: Two Decades of Research in Columnoriented Database Architectures. IEEE Data Eng. Bull. 35, 1 (2012), 40--45. http: //sites.computer.org/debull/A12mar/monetdb.pdf
    [37]
    Apple Inc. 2023. The Swift programming language. https://developer.apple.com/ swift/
    [38]
    ClickHouse Inc. 2023. ClickBench - a Benchmark For Analytical DBMS. https: //benchmark.clickhouse.com/
    [39]
    InfluxData Inc. 2023. Announcing InfluxDB IOx - The Future Core of InfluxDB Built with Rust and Arrow. https://www.influxdata.com/blog/announcing-influxdbiox/
    [40]
    InfluxData Inc. 2023. The Influx Query Language Specification. https://github. com/influxdata/influxql
    [41]
    InfluxData Inc. 2023. InfluxDB - open source time series, metrics, and analytics database. https://influxdata.com/
    [42]
    ISO/IEC 9075:2023 2023. Information technology - Database languages - SQL. Standard. International Organization for Standardization.
    [43]
    Timo Kersten, Viktor Leis, Alfons Kemper, Thomas Neumann, Andrew Pavlo, and Peter A. Boncz. 2018. Everything You Always Wanted to Know About Compiled and Vectorized Queries But Were Afraid to Ask. Proc. VLDB Endow. 11, 13 (2018), 2209--2222. https://doi.org/10.14778/3275366.3275370
    [44]
    Amandeep Khurana and Julien Le Dem. 2018. The Modern Data Architecture: The Deconstructed Database. login Usenix Mag. 43, 4 (2018). https://www.usenix. org/publications/login/winter-2018-vol-43-no-4/khurana
    [45]
    DuckDB Labs. 2023. Parallel Grouped Aggregation in DuckDB. https://duckdb. org/2022/03/07/aggregate-hashtable.html
    [46]
    Andrew Lamb. 2022. Using Rustlang's Async Tokio Runtime for CPU-Bound Tasks. https://thenewstack.io/using-rustlangs-async-tokio-runtime-for-cpubound- tasks/
    [47]
    Andrew Lamb, Matt Fuller, Ramakrishna Varadarajan, Nga Tran, Ben Vandiver, Lyric Doshi, and Chuck Bear. 2012. The Vertica Analytic Database: C-Store 7 Years Later. Proc. VLDB Endow. 5, 12 (2012), 1790--1801. https://doi.org/10.14778/ 2367502.2367518
    [48]
    Andrew Lamb and Raphael Taylor-Davies. 2022. Querying Parquet with Millisecond Latency. https://arrow.apache.org/blog/2022/12/26/querying-parquet-withmillisecond- latency/
    [49]
    Andrew Lamb and Raphael Taylor-Davies. 2023. Fast and Memory Efficient Multi- Column Sorts in Apache Arrow Rust. https://arrow.apache.org/blog/2022/11/07/ multi-column-sorts-in-arrow-rust-part-1/
    [50]
    Andrew Lamb, Raphael Taylor-Davies, and Daniël Heres. 2023. Aggregating Millions of Groups Fast in Apache Arrow DataFusion. https://www.influxdata. com/blog/aggregating-millions-groups-fast-apache-arrow-datafusion
    [51]
    Lance. 2023. Lance: modern columnar data format for ML. https://lancedb.github. io/lance/
    [52]
    Chris Lattner and Vikram S. Adve. 2004. LLVM: A Compilation Framework for Lifelong Program Analysis & Transformation. In 2nd IEEE / ACM International Symposium on Code Generation and Optimization (CGO 2004), 20--24 March 2004, San Jose, CA, USA. IEEE Computer Society, 75--88. https://doi.org/10.1109/CGO. 2004.1281665
    [53]
    Viktor Leis, Peter A. Boncz, Alfons Kemper, and Thomas Neumann. 2014. Morseldriven parallelism: a NUMA-aware query evaluation framework for the manycore age. In International Conference on Management of Data, SIGMOD 2014, Snowbird, UT, USA, June 22--27, 2014, Curtis E. Dyreson, Feifei Li, and M. Tamer Özsu (Eds.). ACM, 743--754. https://doi.org/10.1145/2588555.2610507
    [54]
    Viktor Leis, Kan Kundhikanjana, Alfons Kemper, and Thomas Neumann. 2015. Efficient Processing of Window Functions in Analytical SQL Queries. Proc. VLDB Endow. 8, 10 (2015), 1058--1069. https://doi.org/10.14778/2794367.2794375
    [55]
    Jon Mease. 2023. VegaFusion: Server side scaling for the Vega visualization library. https://vegafusion.io/
    [56]
    Sergey Melnik, Andrey Gubarev, Jing Jing Long, Geoffrey Romer, Shiva Shivakumar, Matt Tolton, and Theo Vassilakis. 2010. Dremel: Interactive Analysis of Web-Scale Datasets. Proc. VLDB Endow. 3, 1 (2010), 330--339. https: //doi.org/10.14778/1920841.1920886
    [57]
    Guido Moerkotte. 1998. Small Materialized Aggregates: A Light Weight Index Structure for Data Warehousing. In VLDB'98, Proceedings of 24rd International Conference on Very Large Data Bases, August 24--27, 1998, New York City, New York, USA, Ashish Gupta, Oded Shmueli, and Jennifer Widom (Eds.). Morgan Kaufmann, 476--487. http://www.vldb.org/conf/1998/p476.pdf
    [58]
    Ramon E. Moore. 1966. Interval analysis. Vol. 4. Prentice-Hall Englewood Cliffs.
    [59]
    The pandas development team. 2020. pandas-dev/pandas: Pandas. https://doi. org/10.5281/zenodo.3509134
    [60]
    Pedro Pedreira, Orri Erling, Maria Basmanova, Kevin Wilfong, Laith S. Sakka, Krishna Pai, Wei He, and Biswapesh Chattopadhyay. 2022. Velox: Meta's Unified Execution Engine. Proc. VLDB Endow. 15, 12 (2022), 3372--3384. https://doi.org/ 10.14778/3554821.3554829
    [61]
    Pedro Pedreira, Orri Erling, Konstantinos Karanasos, Scott Schneider, Wes McKinney, Satyanarayana R. Valluri, Mohamed Zaït, and Jacques Nadeau. 2023. The Composable Data Management System Manifesto. Proc. VLDB Endow. 16, 10 (2023), 2679--2685. https://doi.org/10.14778/3603581.3603604
    [62]
    PostgreSQL. 2024. The PostgreSQL Relationa Database. https://www.postgresql. org/
    [63]
    The Dask project. 2023. The dask-sql project. https://dask-sql.readthedocs.io/en/ latest/
    [64]
    The OAP project. 2023. Gluten: Plugin to Double SparkSQL's Performance. https: //h2oai.github.io/db-benchmark/
    [65]
    Mark Raasveldt, Pedro Holanda, Tim Gubner, and Hannes Mühleisen. 2018. Fair Benchmarking Considered Difficult: Common Pitfalls In Database Performance Testing. In Proceedings of the 7th International Workshop on Testing Database Systems, DBTest@SIGMOD 2018, Houston, TX, USA, June 15, 2018, Alexander Böhm and Tilmann Rabl (Eds.). ACM, 2:1--2:6. https://doi.org/10.1145/3209950.3209955
    [66]
    Mark Raasveldt and Hannes Mühleisen. 2019. DuckDB: an Embeddable Analytical Database. In Proceedings of the 2019 International Conference on Management of Data, SIGMOD Conference 2019, Amsterdam, The Netherlands, June 30 - July 5, 2019, Peter A. Boncz, Stefan Manegold, Anastasia Ailamaki, Amol Deshpande, and Tim Kraska (Eds.). ACM, 1981--1984. https://doi.org/10.1145/3299869.3320212
    [67]
    Tokio rs Developers. 2023. Tokio: As asynchronous Rust runtime. https://tokio.rs/
    [68]
    SDF. 2023. SDF. https://www.sdf.com/engine
    [69]
    Seafowl. 2024. Seafowl Postgres Accelerator. https://seafowl.io/
    [70]
    Lakshmikant Shrinivas, Sreenath Bodagala, Ramakrishna Varadarajan, Ariel Cary, Vivek Bharathan, and Chuck Bear. 2013. Materialization strategies in the Vertica analytic database: Lessons learned. In 29th IEEE International Conference on Data Engineering, ICDE 2013, Brisbane, Australia, April 8--12, 2013, Christian S. Jensen, Christopher M. Jermaine, and Xiaofang Zhou (Eds.). IEEE Computer Society, 1196--1207. https://doi.org/10.1109/ICDE.2013.6544909
    [71]
    Moritz Sichert and Thomas Neumann. 2022. User-Defined Operators: Efficiently Integrating Custom Algorithms into Modern Databases. Proc. VLDB Endow. 15, 5 (2022), 1119--1131. https://doi.org/10.14778/3510397.3510408
    [72]
    The sqlparser-rs authors. 2023. sqlparser-rs: Extensible SQL Lexer and Parser for Rust. https://github.com/sqlparser-rs/sqlparser-rs
    [73]
    Michael Stonebraker. 2008. Technical perspective - One size fits all: an idea whose time has come and gone. Commun. ACM 51, 12 (2008), 76. https://doi.org/10. 1145/1409360.1409379
    [74]
    Michael Stonebraker, Daniel J. Abadi, Adam Batkin, Xuedong Chen, Mitch Cherniack, Miguel Ferreira, Edmond Lau, Amerson Lin, Samuel Madden, Elizabeth J. O'Neil, Patrick E. O'Neil, Alex Rasin, Nga Tran, and Stanley B. Zdonik. 2005. C-Store: A Column-oriented DBMS. In Proceedings of the 31st International Conference on Very Large Data Bases, Trondheim, Norway, August 30 - September 2, 2005, Klemens Böhm, Christian S. Jensen, Laura M. Haas, Martin L. Kersten, Per-Åke Larson, and Beng Chin Ooi (Eds.). ACM, 553--564. http://www.vldb.org/ archives/website/2005/program/paper/thu/p553-stonebraker.pdf
    [75]
    Synnada. 2023. Synnada realtime data platform. https://www.synnada.ai/
    [76]
    The Rust team. 2023. The Rust programming language. https://www.rustlang. org/
    [77]
    Matei Zaharia, Reynold S. Xin, PatrickWendell, Tathagata Das, Michael Armbrust, Ankur Dave, Xiangrui Meng, Josh Rosen, Shivaram Venkataraman, Michael J. Franklin, Ali Ghodsi, Joseph Gonzalez, Scott Shenker, and Ion Stoica. 2016. Apache Spark: a unified engine for big data processing. Commun. ACM 59, 11 (2016), 56--65. https://doi.org/10.1145/2934664

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Conferences
    SIGMOD/PODS '24: Companion of the 2024 International Conference on Management of Data
    June 2024
    694 pages
    ISBN:9798400704222
    DOI:10.1145/3626246
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

    Sponsors

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 09 June 2024

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. api design
    2. column stores
    3. database systems
    4. modular query engines
    5. olap
    6. parallel execution
    7. vectorized execution

    Qualifiers

    • Research-article

    Conference

    SIGMOD/PODS '24
    Sponsor:

    Acceptance Rates

    Overall Acceptance Rate 785 of 4,003 submissions, 20%

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • 0
      Total Citations
    • 1,366
      Total Downloads
    • Downloads (Last 12 months)1,366
    • Downloads (Last 6 weeks)309
    Reflects downloads up to 06 Aug 2024

    Other Metrics

    Citations

    View Options

    Get Access

    Login options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media