SlideShare a Scribd company logo
Apache Tajo:
A Big Data Warehouse System
on Hadoop
Hyunsik Choi
Director of Research, Gruter
Big Data Camp LA 2014
Talk Outline
• Introduction to Apache Tajo
• What you can do with Tajo
• Why you should use Tajo
• Current Status of Tajo Project
• Demonstration
About Me
• Hyunsik Choi (pronounced “Hyeon-shick Cheh”)
• PhD (Computer Science & Engineering, 2013), Korea Univ
.
• Director of Research, Gruter Corp
• Open-source Involvement
– Full-time contributor to Apache Tajo (2013.6 ~ )
– Apache Tajo PMC member and committer (2013.3 ~ )
– Apache Giraph PMC member and committer (2011. 8 ~ )
• Contact Info
– Email: hyunsik@apache.org
– Linkedin: http://linkedin.com/in/hyunsikchoi/
Apache Tajo
• Open-source “SQL-on-H” “Big DW” system
• Apache Top-level project since March 2014
• Supports SQL standards
• Low latency, long running batch queries
• Features
– Supports Joins (inner and all outer), Groupby, and Sort
– Window function
– Most SQL data types supported (except for Decimal)
• Recent 0.8.0 release
– https://blogs.apache.org/tajo/entry/apache_tajo_0_8_0
Overall Architecture
What You Can Do with Tajo
• Batch queries
– Long-running queries (~ hours)
• Dynamic Scheduling
• Fault Tolerance
– ETL workloads
• Interactive Ad-hoc Queries
– Very low-latency (100 ms ~)
– Few seconds on several TB dataset if you cluster
capability is enough
Why You Should Use Tajo
• SQL Standards
– Non standard features – PgSQL and Oracle
• Simple Installation and Operation
– http://tajo.apache.org/docs/0.8.0/getting_started.html
• Simple Software Stack Requirement
– No MapReduce and No Tez
– Yarn support but not mandatory
– Tajo + Linux system for single node cluster
– Tajo + HDFS for a distributed cluster
Why You Should Use Tajo
• Mature SQL Feature Set
– Fully distributed query executions
• Inner join, and left/right/full outer join
• Groupby, sort, multiple distinct aggregation, window function
– SQL data types
• CHAR, BOOL, INT, BIGINT, REAL, DOUBLE, and TEXT
• TIMESTAMP, DATE, TIME, and INTERVAL
• DECIMAL (working)
– Various file formats
• Text file (CSV), RCFile, Parquet (flat schema), and
Avro (flat schema)
Why You Should Use Tajo
• Fully community-driven open source
• Stable development team
– 5 fulltime contributors + many contributors
• Performance and speed
– Faster than Hive 0.10 (1.5 – 10 times)
– Tajo v.s. Hive 0.13 ?
– Tajo v.s. Impala ?
Why You Should Use Tajo
• Integration with Hadoop Ecosystem
– Hadoop 2.2.0 – 2.4.0 support
– Be able to connect to Hive Metastore
– Directly process tables managed by Hive
– Yarn support (backport)
• Enable Tajo to deploy and run on Yarn cluster
• Allow users to add/remove cluster nodes to/from Tajo
cluster in runtime
• Contributed by Min Zhou (committer), Linkedin Engineer
• https://github.com/coderplay/tajo-yarn
Current Status – Overall
• Under beta stage – majority of key features are getting ready
• Most of SQL features implemented
• Working on hundreds of clusters for
production
– Collaboration with the biggest telco in S. Korea
• We’ve just started works on low-level
optimization.
– Runtime byte code generation (v0.9)
– Unsafe-based hash table for hash aggregation/join
– Vectorized execution engine
Current Status – Logical Plan Optimizer
• Basic Rewrite Rule
– Common sub expression elimination
– Constant folding (CF), and Null propagation
• Projection Push Down (PPD)
– push expressions to operators lower as possible
– narrow read columns
– remove duplicated expressions
• if some expressions has common expression
• Filter Push Down (FPD)
– reduce rows to be processed earlier as possible
• Extensible Rewrite Rule
– Allow developers to write their own rewrite rules
Current Status – Logical Plan Optimizer
SELECT
item_id,
order_id
sum_price * (1.2 * 0.3)
as total,
FROM (
SELECT
item_id,
order_id,
sum(price) as sum_price
FROM
ITEMS
GROUP BY item_id, order_id
) a
WHERE item_id = 17234
SELECT
item_id,
order_id,
sum(price) * (3.6)
FROM
ITEMS
GROUP BY
item_id,
order_id
WHERE item_id = 17234
Original Rewritten
CF + PPD
FPD
Current Status – Logical Plan Optimizer
• Cost-based Join Order (since v0.2)
– Don’t need to guess right join orders anymore
– Greedy heuristic algorithm
• Resulting in a bushy join tree instead of left-deep join tree
Left-deep Join Tree Bush Join Tree
Current Status – Window Function
• OVER clause
– row_number() and rank()
– Aggregation function support
– PARTITION and ORDER BY clause
SELECT depname, empno, salary, enroll_date FROM (
SELECT
depname, empno, salary, enroll_date,
rank() OVER (PARTITION BY depname
ORDER BY salary DESC, empno) AS pos
FROM empsalary
) AS ss
WHERE
pos < 3;
Current Status – Join
• Join
– NATURAL, INNER, OUTER (LEFT, RIGHT, FULL)
– SEMI, ANTI Join (planned for v0.9)
• Join Predicates
– WHERE and ON predicates
– de-factor standard outer join behavior with both
predicates
SELECT * FROM t1 LEFT JOIN t2 ON t1.num = t2.num
WHERE t2.value = 'xxx';
SELECT * FROM t1 LEFT JOIN t2 WHERE t1.num = t2.n
um and t2.value = ‘xxx’;
Current Status – Table Partitions
• Column Value Partition
– Hive Compatible Partition
• Range Partition (planned for 1.0)
– Table will be partitioned by disjoint ranges.
– Will remove the partition granularity problem of
Hive Partition
CREATE TABLE T1 (C1 INT, C2 TEXT)
using PARQUET
WITH (‘parquet.compression’ = ‘SNAPPY’)
PARTITION BY COLUMN (C3 INT, C4 TEXT);
Future Works
• Multi-tenant Scheduler (v0.9)
– Support multiple users and multiple queries
• Runtime byte code generation for
expressions (v0.9)
– Eliminate interpret overhead of expression evaluation
• Authentication and SQL Standard Access Control
• JIT-based Vectorized Processing Engine
– Refer to Hadoop Summit 2014 Slide
(http://goo.gl/jWghhp)
Get Involved!
• We are recruiting contributors!
• General
– http://tajo.apache.org
• Getting Started
– http://tajo.apache.org/docs/0.8.0/getting_started.html
• Downloads
– http://tajo.apache.org/docs/0.8.0/getting_started/downloading_source.html
• Jira – Issue Tracker
– https://issues.apache.org/jira/browse/TAJO
• Join the mailing list
– dev-subscribe@tajo.apache.org
– issues-subscribe@tajo.apache.org
Demonstration

More Related Content

Big Data Camp LA 2014 - Apache Tajo: A Big Data Warehouse System on Hadoop

  • 1. Apache Tajo: A Big Data Warehouse System on Hadoop Hyunsik Choi Director of Research, Gruter Big Data Camp LA 2014
  • 2. Talk Outline • Introduction to Apache Tajo • What you can do with Tajo • Why you should use Tajo • Current Status of Tajo Project • Demonstration
  • 3. About Me • Hyunsik Choi (pronounced “Hyeon-shick Cheh”) • PhD (Computer Science & Engineering, 2013), Korea Univ . • Director of Research, Gruter Corp • Open-source Involvement – Full-time contributor to Apache Tajo (2013.6 ~ ) – Apache Tajo PMC member and committer (2013.3 ~ ) – Apache Giraph PMC member and committer (2011. 8 ~ ) • Contact Info – Email: hyunsik@apache.org – Linkedin: http://linkedin.com/in/hyunsikchoi/
  • 4. Apache Tajo • Open-source “SQL-on-H” “Big DW” system • Apache Top-level project since March 2014 • Supports SQL standards • Low latency, long running batch queries • Features – Supports Joins (inner and all outer), Groupby, and Sort – Window function – Most SQL data types supported (except for Decimal) • Recent 0.8.0 release – https://blogs.apache.org/tajo/entry/apache_tajo_0_8_0
  • 6. What You Can Do with Tajo • Batch queries – Long-running queries (~ hours) • Dynamic Scheduling • Fault Tolerance – ETL workloads • Interactive Ad-hoc Queries – Very low-latency (100 ms ~) – Few seconds on several TB dataset if you cluster capability is enough
  • 7. Why You Should Use Tajo • SQL Standards – Non standard features – PgSQL and Oracle • Simple Installation and Operation – http://tajo.apache.org/docs/0.8.0/getting_started.html • Simple Software Stack Requirement – No MapReduce and No Tez – Yarn support but not mandatory – Tajo + Linux system for single node cluster – Tajo + HDFS for a distributed cluster
  • 8. Why You Should Use Tajo • Mature SQL Feature Set – Fully distributed query executions • Inner join, and left/right/full outer join • Groupby, sort, multiple distinct aggregation, window function – SQL data types • CHAR, BOOL, INT, BIGINT, REAL, DOUBLE, and TEXT • TIMESTAMP, DATE, TIME, and INTERVAL • DECIMAL (working) – Various file formats • Text file (CSV), RCFile, Parquet (flat schema), and Avro (flat schema)
  • 9. Why You Should Use Tajo • Fully community-driven open source • Stable development team – 5 fulltime contributors + many contributors • Performance and speed – Faster than Hive 0.10 (1.5 – 10 times) – Tajo v.s. Hive 0.13 ? – Tajo v.s. Impala ?
  • 10. Why You Should Use Tajo • Integration with Hadoop Ecosystem – Hadoop 2.2.0 – 2.4.0 support – Be able to connect to Hive Metastore – Directly process tables managed by Hive – Yarn support (backport) • Enable Tajo to deploy and run on Yarn cluster • Allow users to add/remove cluster nodes to/from Tajo cluster in runtime • Contributed by Min Zhou (committer), Linkedin Engineer • https://github.com/coderplay/tajo-yarn
  • 11. Current Status – Overall • Under beta stage – majority of key features are getting ready • Most of SQL features implemented • Working on hundreds of clusters for production – Collaboration with the biggest telco in S. Korea • We’ve just started works on low-level optimization. – Runtime byte code generation (v0.9) – Unsafe-based hash table for hash aggregation/join – Vectorized execution engine
  • 12. Current Status – Logical Plan Optimizer • Basic Rewrite Rule – Common sub expression elimination – Constant folding (CF), and Null propagation • Projection Push Down (PPD) – push expressions to operators lower as possible – narrow read columns – remove duplicated expressions • if some expressions has common expression • Filter Push Down (FPD) – reduce rows to be processed earlier as possible • Extensible Rewrite Rule – Allow developers to write their own rewrite rules
  • 13. Current Status – Logical Plan Optimizer SELECT item_id, order_id sum_price * (1.2 * 0.3) as total, FROM ( SELECT item_id, order_id, sum(price) as sum_price FROM ITEMS GROUP BY item_id, order_id ) a WHERE item_id = 17234 SELECT item_id, order_id, sum(price) * (3.6) FROM ITEMS GROUP BY item_id, order_id WHERE item_id = 17234 Original Rewritten CF + PPD FPD
  • 14. Current Status – Logical Plan Optimizer • Cost-based Join Order (since v0.2) – Don’t need to guess right join orders anymore – Greedy heuristic algorithm • Resulting in a bushy join tree instead of left-deep join tree Left-deep Join Tree Bush Join Tree
  • 15. Current Status – Window Function • OVER clause – row_number() and rank() – Aggregation function support – PARTITION and ORDER BY clause SELECT depname, empno, salary, enroll_date FROM ( SELECT depname, empno, salary, enroll_date, rank() OVER (PARTITION BY depname ORDER BY salary DESC, empno) AS pos FROM empsalary ) AS ss WHERE pos < 3;
  • 16. Current Status – Join • Join – NATURAL, INNER, OUTER (LEFT, RIGHT, FULL) – SEMI, ANTI Join (planned for v0.9) • Join Predicates – WHERE and ON predicates – de-factor standard outer join behavior with both predicates SELECT * FROM t1 LEFT JOIN t2 ON t1.num = t2.num WHERE t2.value = 'xxx'; SELECT * FROM t1 LEFT JOIN t2 WHERE t1.num = t2.n um and t2.value = ‘xxx’;
  • 17. Current Status – Table Partitions • Column Value Partition – Hive Compatible Partition • Range Partition (planned for 1.0) – Table will be partitioned by disjoint ranges. – Will remove the partition granularity problem of Hive Partition CREATE TABLE T1 (C1 INT, C2 TEXT) using PARQUET WITH (‘parquet.compression’ = ‘SNAPPY’) PARTITION BY COLUMN (C3 INT, C4 TEXT);
  • 18. Future Works • Multi-tenant Scheduler (v0.9) – Support multiple users and multiple queries • Runtime byte code generation for expressions (v0.9) – Eliminate interpret overhead of expression evaluation • Authentication and SQL Standard Access Control • JIT-based Vectorized Processing Engine – Refer to Hadoop Summit 2014 Slide (http://goo.gl/jWghhp)
  • 19. Get Involved! • We are recruiting contributors! • General – http://tajo.apache.org • Getting Started – http://tajo.apache.org/docs/0.8.0/getting_started.html • Downloads – http://tajo.apache.org/docs/0.8.0/getting_started/downloading_source.html • Jira – Issue Tracker – https://issues.apache.org/jira/browse/TAJO • Join the mailing list – dev-subscribe@tajo.apache.org – issues-subscribe@tajo.apache.org