Big Data Camp LA 2014 - Apache Tajo: A Big Data Warehouse System on Hadoop

Apache Tajo:
A Big Data Warehouse System
on Hadoop
Hyunsik Choi
Director of Research, Gruter
Big Data Camp LA 2014

Talk Outline
• Introduction to Apache Tajo
• What you can do with Tajo
• Why you should use Tajo
• Current Status of Tajo Project
• Demonstration

About Me
• Hyunsik Choi (pronounced “Hyeon-shick Cheh”)
• PhD (Computer Science & Engineering, 2013), Korea Univ
.
• Director of Research, Gruter Corp
• Open-source Involvement
– Full-time contributor to Apache Tajo (2013.6 ~ )
– Apache Tajo PMC member and committer (2013.3 ~ )
– Apache Giraph PMC member and committer (2011. 8 ~ )
• Contact Info
– Email: hyunsik@apache.org
– Linkedin: http://linkedin.com/in/hyunsikchoi/

Apache Tajo
• Open-source “SQL-on-H” “Big DW” system
• Apache Top-level project since March 2014
• Supports SQL standards
• Low latency, long running batch queries
• Features
– Supports Joins (inner and all outer), Groupby, and Sort
– Window function
– Most SQL data types supported (except for Decimal)
• Recent 0.8.0 release
– https://blogs.apache.org/tajo/entry/apache_tajo_0_8_0

What You Can Do with Tajo
• Batch queries
– Long-running queries (~ hours)
• Dynamic Scheduling
• Fault Tolerance
– ETL workloads
• Interactive Ad-hoc Queries
– Very low-latency (100 ms ~)
– Few seconds on several TB dataset if you cluster
capability is enough

Why You Should Use Tajo
• SQL Standards
– Non standard features – PgSQL and Oracle
• Simple Installation and Operation
– http://tajo.apache.org/docs/0.8.0/getting_started.html
• Simple Software Stack Requirement
– No MapReduce and No Tez
– Yarn support but not mandatory
– Tajo + Linux system for single node cluster
– Tajo + HDFS for a distributed cluster

• Mature SQL Feature Set
– Fully distributed query executions
• Inner join, and left/right/full outer join
• Groupby, sort, multiple distinct aggregation, window function
– SQL data types
• CHAR, BOOL, INT, BIGINT, REAL, DOUBLE, and TEXT
• TIMESTAMP, DATE, TIME, and INTERVAL
• DECIMAL (working)
– Various file formats
• Text file (CSV), RCFile, Parquet (flat schema), and
Avro (flat schema)

• Fully community-driven open source
• Stable development team
– 5 fulltime contributors + many contributors
• Performance and speed
– Faster than Hive 0.10 (1.5 – 10 times)
– Tajo v.s. Hive 0.13 ?
– Tajo v.s. Impala ?

• Integration with Hadoop Ecosystem
– Hadoop 2.2.0 – 2.4.0 support
– Be able to connect to Hive Metastore
– Directly process tables managed by Hive
– Yarn support (backport)
• Enable Tajo to deploy and run on Yarn cluster
• Allow users to add/remove cluster nodes to/from Tajo
cluster in runtime
• Contributed by Min Zhou (committer), Linkedin Engineer
• https://github.com/coderplay/tajo-yarn

Current Status – Overall
• Under beta stage – majority of key features are getting ready
• Most of SQL features implemented
• Working on hundreds of clusters for
production
– Collaboration with the biggest telco in S. Korea
• We’ve just started works on low-level
optimization.
– Runtime byte code generation (v0.9)
– Unsafe-based hash table for hash aggregation/join
– Vectorized execution engine

Current Status – Logical Plan Optimizer
• Basic Rewrite Rule
– Common sub expression elimination
– Constant folding (CF), and Null propagation
• Projection Push Down (PPD)
– push expressions to operators lower as possible
– narrow read columns
– remove duplicated expressions
• if some expressions has common expression
• Filter Push Down (FPD)
– reduce rows to be processed earlier as possible
• Extensible Rewrite Rule
– Allow developers to write their own rewrite rules

SELECT
item_id,
order_id
sum_price * (1.2 * 0.3)
as total,
FROM (
SELECT
item_id,
order_id,
sum(price) as sum_price
FROM
ITEMS
GROUP BY item_id, order_id
) a
WHERE item_id = 17234
SELECT
item_id,
order_id,
sum(price) * (3.6)
FROM
ITEMS
GROUP BY
item_id,
order_id
WHERE item_id = 17234
Original Rewritten
CF + PPD
FPD

• Cost-based Join Order (since v0.2)
– Don’t need to guess right join orders anymore
– Greedy heuristic algorithm
• Resulting in a bushy join tree instead of left-deep join tree
Left-deep Join Tree Bush Join Tree

Current Status – Window Function
• OVER clause
– row_number() and rank()
– Aggregation function support
– PARTITION and ORDER BY clause
SELECT depname, empno, salary, enroll_date FROM (
SELECT
depname, empno, salary, enroll_date,
rank() OVER (PARTITION BY depname
ORDER BY salary DESC, empno) AS pos
FROM empsalary
) AS ss
WHERE
pos < 3;

Current Status – Join
• Join
– NATURAL, INNER, OUTER (LEFT, RIGHT, FULL)
– SEMI, ANTI Join (planned for v0.9)
• Join Predicates
– WHERE and ON predicates
– de-factor standard outer join behavior with both
predicates
SELECT * FROM t1 LEFT JOIN t2 ON t1.num = t2.num
WHERE t2.value = 'xxx';
SELECT * FROM t1 LEFT JOIN t2 WHERE t1.num = t2.n
um and t2.value = ‘xxx’;

Current Status – Table Partitions
• Column Value Partition
– Hive Compatible Partition
• Range Partition (planned for 1.0)
– Table will be partitioned by disjoint ranges.
– Will remove the partition granularity problem of
Hive Partition
CREATE TABLE T1 (C1 INT, C2 TEXT)
using PARQUET
WITH (‘parquet.compression’ = ‘SNAPPY’)
PARTITION BY COLUMN (C3 INT, C4 TEXT);

Future Works
• Multi-tenant Scheduler (v0.9)
– Support multiple users and multiple queries
• Runtime byte code generation for
expressions (v0.9)
– Eliminate interpret overhead of expression evaluation
• Authentication and SQL Standard Access Control
• JIT-based Vectorized Processing Engine
– Refer to Hadoop Summit 2014 Slide
(http://goo.gl/jWghhp)

Get Involved!
• We are recruiting contributors!
• General
– http://tajo.apache.org
• Getting Started
– http://tajo.apache.org/docs/0.8.0/getting_started.html
• Downloads
– http://tajo.apache.org/docs/0.8.0/getting_started/downloading_source.html
• Jira – Issue Tracker
– https://issues.apache.org/jira/browse/TAJO
• Join the mailing list
– dev-subscribe@tajo.apache.org
– issues-subscribe@tajo.apache.org

Big Data Camp LA 2014 - Apache Tajo: A Big Data Warehouse System on Hadoop

Related slideshows

More Related Content

Big Data Camp LA 2014 - Apache Tajo: A Big Data Warehouse System on Hadoop