InfluxDB IOx Tech Talks: Query Engine Design and the Rust-Based DataFusion in Apache Arrow
- 1. Paul Dix
InfluxData – CTO & co-founder
paul@influxdata.com
@pauldix
InfluxDB IOx - a new columnar
time series database (update)
- 2. © 2021 InfluxData. All rights reserved.
2
API
• InfluxDB 2.x with Line Protocol
• HTTP Query with JSON, CSV, Print
• Arrow Flight
• Move over to gRCP for management (and CLI)
– Create Databases
– Start to defining replication/sharding
– Readme for gRPCurl
• gRCP Health
- 3. © 2021 InfluxData. All rights reserved.
3
CLI & Config
• Write Line Protocol from File
• Create Database
• Object Store parameters
- 4. © 2021 InfluxData. All rights reserved.
4
Query
• Queries now work across Mutable Buffer & Read Buffer
• Data Fusion (features)
• Massive infusion of postgres string functions (lpad, rpad, ascii, chr, ltrim, etc)
• Support for EXTRACT (e.g. `EXTRACT hour from date_col`)
• Data Fusion (performance)
• Optimized function implementation for scalar values and columns
• improved join indicies, support for more advanced statistics, expression
rewriting
- 5. © 2021 InfluxData. All rights reserved.
5
Path to OSS Builds
• Not until we think it’s useful/interesting to test
• Dogfood our monitoring
1. In-memory 2.4M values/sec
2. Basic proxied/distributed query
3. Mutable Buffer to Read Buffer lifecycle (basic)
4. WAL Buffering/persistence
5. Subscriptions
6. Parquet Persistence
7. Recovery
• Single Server Steady State
• CLI for configuration
• Documentation
- 7. Today: IOx Team at InfluxData
Past life 1: Query Optimizer @ Vertica, also
on Oracle DB server
Past life 2: Chief Architect + VP Engineering
roles at some ML startups
- 8. Talk Outline
What is a Query Engine
Introduction to DataFusion / Apache Arrow
DataFusion Architectural Overview
- 11. Motivation
Users who want to
access data
without writing a
program
UIs (visual and
textual)
Data is stored
somewhere
Query Engine
SQL is the
common
interface
- 12. DataFusion Use Cases
1. Data engineering / ETL:
a. Construct fast and efficient data pipelines (~ Spark)
2. Data Science
a. Prepare data for ML / other tasks (~ Pandas)
3. Database Systems:
a. E.g. IOx, Ballista, Cloudfuse Buzz, various internal systems
- 13. Why DataFusion?
High Performance: Memory (no GC) and Performance, leveraging Rust/Arrow
Easy to Connect: Interoperability with other tools via Arrow, Parquet and Flight
Easy to Embed: Can extend data sources, functions, operators
First Class Rust: High quality Query / SQL Engine entirely in Rust
High Quality: Extensive tests and integration tests with Arrow ecosystems
My goal: DataFusion to be *the* choice for any SQL support in Rust
- 14. DBMS vs Query Engine ( , )
Database Management Systems (DBMS) are full featured systems
● Storage system (stores actual data)
● Catalog (store metadata about what is in the storage system)
● Query Engine (query, and retrieve requested data)
● Access Control and Authorization (users, groups, permissions)
● Resource Management (divide resources between uses)
● Administration utilities (monitor resource usage, set policies, etc)
● Clients for Network connectivity (e.g. implement JDBC, ODBC, etc)
● Multi-node coordination and management
DataFusion
- 15. What is DataFusion?
“DataFusion is an in-memory query engine
that uses Apache Arrow as the memory
model” - crates.io
● In Apache Arrow github repo
● Apache licensed
● Not part of the Arrow spec, uses Arrow
● Initially implemented and donated by
Andy Grove; design based on How
Query Engines Work
- 17. DataFusion Extensibility 🧰
● User Defined Functions
● User Defined Aggregates
● User Defined Optimizer passes
● User Defined LogicalPlan nodes
● User Defined ExecutionPlan nodes
● User Defined TableProvider for tables
* Built in data persistence using parquet and CSV files
- 18. What is a Query Engine?
1. Frontend
a. Query Language + Parser
2. Intermediate Query Representation
a. Expression / Type system
b. Query Plan w/ Relational Operators (Data Flow Graph)
c. Rewrites / Optimizations on that graph
3. Concrete Execution Operators
a. Allocate resources (CPU, Memory, etc)
b. Pushed bytes around, vectorized calculations, etc
��
- 19. DataFusion is a Query Engine!
SQLStatement
1. Frontend
LogicalPlan
Expr
ExecutionPlan
RecordBatches
Rust struct
2. Intermediate Query Representation
3. Concrete Execution Operators
- 20. DataFusion Input / Output Diagram
SQL Query
SELECT status, COUNT(1)
FROM http_api_requests_total
WHERE path = '/api/v2/write'
GROUP BY status;
RecordBatches
DataFrame
ctx.read_table("http")?
.filter(...)?
.aggregate(..)?;
RecordBatches
Catalog information:
tables, schemas, etc
OR
- 22. DataFusion CLI
> CREATE EXTERNAL TABLE
http_api_requests_total
STORED AS PARQUET
LOCATION
'http_api_requests_total.parquet';
+--------+-----------------+
| status | COUNT(UInt8(1)) |
+--------+-----------------+
| 4XX | 73621 |
| 2XX | 338304 |
+--------+-----------------+
> SELECT status, COUNT(1)
FROM http_api_requests_total
WHERE path = '/api/v2/write'
GROUP BY status;
- 23. EXPLAIN Plan
Gets a textual representation of LogicalPlan
+--------------+----------------------------------------------------------+
| plan_type | plan |
+--------------+----------------------------------------------------------+
| logical_plan | Aggregate: groupBy=[[#status]], aggr=[[COUNT(UInt8(1))]] |
| | Selection: #path Eq Utf8("/api/v2/write") |
| | TableScan: http_api_requests_total projection=None |
+--------------+----------------------------------------------------------+
> explain SELECT status, COUNT(1) FROM http_api_requests_total
WHERE path = '/api/v2/write' GROUP BY status;
- 24. Plans as DataFlow graphs
Filter:
#path Eq Utf8("/api/v2/write")
Aggregate:
groupBy=[[#status]],
aggr=[[COUNT(UInt8(1))]]
TableScan: http_api_requests_total
projection=None
Step 2: Predicate is applied
Step 1: Parquet file is read
Step 3: Data is aggregated
Data flows up from the
leaves to the root of the
tree
- 25. More than initially meets the eye
Use EXPLAIN VERBOSE to see optimizations applied
> EXPLAIN VERBOSE SELECT status, COUNT(1) FROM http_api_requests_total
WHERE path = '/api/v2/write' GROUP BY status;
+----------------------+----------------------------------------------------------------+
| plan_type | plan |
+----------------------+----------------------------------------------------------------+
| logical_plan | Aggregate: groupBy=[[#status]], aggr=[[COUNT(UInt8(1))]] |
| | Selection: #path Eq Utf8("/api/v2/write") |
| | TableScan: http_api_requests_total projection=None |
| projection_push_down | Aggregate: groupBy=[[#status]], aggr=[[COUNT(UInt8(1))]] |
| | Selection: #path Eq Utf8("/api/v2/write") |
| | TableScan: http_api_requests_total
projection=Some([6, 8]) |
| type_coercion | Aggregate: groupBy=[[#status]], aggr=[[COUNT(UInt8(1))]] |
| | Selection: #path Eq Utf8("/api/v2/write") |
| | TableScan: http_api_requests_total
projection=Some([6, 8]) |
...
+----------------------+----------------------------------------------------------------+
Optimizer “pushed” down
projection so only status
and path columns from
file were read from
parquet
- 27. Array + Record Batches + Schema
+--------+--------+
| status | COUNT |
+--------+--------+
| 4XX | 73621 |
| 2XX | 338304 |
| 5XX | 42 |
| 1XX | 3 |
+--------+--------+
4XX
2XX
5XX
* StringArray representation is somewhat misleading as it actually has a fixed length portion and the character data in different locations
StringArray
1XX
StringArray
73621
338304
42
UInt64Array
3
UInt64Array
Schema:
fields[0]: “status”, Utf8
fields[1]: “COUNT()”, UInt64
RecordBatch
cols:
schema:
RecordBatch
cols:
schema:
- 29. DataFusion Planning Flow
SQL Query
SELECT status, COUNT(1)
FROM http_api_requests_total
WHERE path = '/api/v2/write'
GROUP BY status;
LogicalPlan
ExecutionPlan
RecordBatches
Parsing/Planning
Optimization
Execution
“Query Plan”
PG:” Query Tree”
“Access Plan”
“Operator Tree”
PG: “Plan Tree”
- 30. DataFusion Logical Plan Creation
● Declarative: Describe WHAT you want; system figures out HOW
○ Input: “SQL” text (postgres dialect)
● Procedural Describe HOW directly
○ Input is a program to build up the plan
○ Two options:
■ Use a LogicalPlanBuilder, Rust style builder
■ DataFrame - model popularized by Pandas and Spark
- 31. SQL → LogicalPlan
SQL Parser
SQL Query
SELECT status, COUNT(1)
FROM http_api_requests_total
WHERE path = '/api/v2/write'
GROUP BY status;
Planner
Query {
ctes: [],
body: Select(
Select {
distinct: false,
top: None,
projection: [
UnnamedExpr(
Identifier(
Ident {
value: "status",
quote_style: None,
},
),
),
...
Parsed
Statement
LogicalPlan
- 32. “DataFrame” → Logical Plan
Rust Code
let df = ctx
.read_table("http_api_requests_total")?
.filter(col("path").eq(lit("/api/v2/write")))?
.aggregate([col("status")]), [count(lit(1))])?;
DataFrame
(Builder)
LogicalPlan
- 33. Supported Logical Plan operators (source link)
Projection
Filter
Aggregate
Sort
Join
Repartition
TableScan
EmptyRelation
Limit
CreateExternalTable
Explain
Extension
- 34. Query Optimization Overview
Compute the same (correct) result, only faster
Optimizer
Pass 1
LogicalPlan
(intermediate)
“Optimizer”
Optimizer
Pass 2
LogicalPlan
(input)
LogicalPlan
(output)
…
Other
Passes
...
- 35. Built in DataFusion Optimizer Passes (source link)
ProjectionPushDown: Minimize the number of columns passed from node to node
to minimize intermediate result size (number of columns)
FilterPushdown (“predicate pushdown”): Push filters as close to scans as possible
to minimize intermediate result size
HashBuildProbeOrder (“join reordering”): Order joins to minimize the intermediate
result size and hash table sizes
ConstantFolding: Partially evaluates expressions at plan time. Eg. ColA && true
→ ColA
- 37. Expression Evaluation
Arrow Compute Kernels typically operate on 1 or 2 arrays and/or scalars.
Partial list of included comparison kernels:
eq Perform left == right operation on two arrays.
eq_scalar Perform left == right operation on an array and a scalar value.
eq_utf8 Perform left == right operation on StringArray / LargeStringArray.
eq_utf8_scalar Perform left == right operation on StringArray / LargeStringArray and a scalar.
and Performs AND operation on two arrays. If either left or right value is null then the result is also null.
is_not_null Returns a non-null BooleanArray with whether each value of the array is not null.
or Performs OR operation on two arrays. If either left or right value is null then the result is also null.
...
- 38. Exprs for evaluating arbitrary expressions
path = '/api/v2/write' OR path IS NULL
Column
path
Literal
ScalarValue::Utf8
'/api/v2/write'
Column
path
IsNull
BinaryExpr
op: Eq
left right
BinaryExpr
op: Or
left right
col(“path”)
.eq(lit(‘api/v2/write’))
.or(col(“path”).is_null())
Expression Builder API
- 47. Type Coercion
sqrt(col)
sqrt(col) → sqrt(CAST col as Float32)
col is Int8, but sqrt implemented for Float32 or Float64
⇒ Type Coercion: adds typecast cast so the implementation can be called
Note: Coercion is lossless; if col was Float64, would not coerce to Float32
Source Code: coercion.rs
- 49. Plan Execution Overview
Typically called the “execution engine” in database systems
DataFusion features:
● Async: Mostly avoids blocking I/O
● Vectorized: Process RecordBatch at a time, configurable batch size
● Eager Pull: Data is produced using a pull model, natural backpressure
● Partitioned: each operator produces partitions, in parallel
● Multi-Core*
* Uses async tasks; still some unease about this / if we need another thread pool
- 53. next()
SendableRecordBatchStream
GroupHash
AggregateStream
FilterExecStream
“ParquetStream”*
For file1
Ready to produce values! 😅
Rust Stream: an async iterator that
produces record batches
Execution of GroupHash starts
eagerly (before next() is called on it)
next().await
next().await
RecordBatch
RecordBatch
Step 2:
Data is
filtered
Step 1: Data read from parquet
and returned
Step 3: data
is fed into a
hash table
Step 0: new task spawned, starts
computing input immediately
Step 5: output is requested RecordBatch
Step 6:
returned to
caller
Step 4:
hash done,
output
produced
- 54. next()
GroupHash
AggregateStream
GroupHash
AggregateStream
GroupHash
AggregateStream
next().await
Step 1: output is requested
MergeStream
MergeStream eagerly
starts on its own task, back
pressure via bounded
channels
Step 0: new task spawned, starts
computing input
RecordBatch
Step 2: eventually RecordBatch is
produced from downstream and returned
Step 0: new task spawned, starts
computing input immediately next().await
next().await
Step 0: new task spawned, starts
computing input
next().await
Step 4: data
is fed into a
hash table
RecordBatch
Step 3: Merge
passes on
RecordBatch
RecordBatch
Step 5:
hash done,
output
produced
Step 6:
returned to
caller
- 55. Get Involved
Check out the project Apache Arrow
Join the mailing list (links on project page)
Test out Arrow (crates.io) and DataFusion (crates.io) in your projects
Help out with the docs/code/tickets on GitHub
Thank You!!!!