SlideShare a Scribd company logo
Apache HBase at Yahoo Scale
PUSHING THE LIMITS
Francis Liu
HBase Yahoo
HBase @
HBase @ Y!
• Hosted multi-tenant clusters
• 3 Production
• 3 Sandbox
• HBase-only
• Off-stage Use Cases
• Internal 0.98 releases
• Security
HBase
Client
HBase
Client
Resource Mgr Namenode
TaskTracker
DataNode
Namenode
RegionServer
DataNode
RegionServer
DataNode
RegionServer
DataNode
HBase Master
Zookeeper
Quorum
HBase
Client
MR Client
M/R Task
TaskTracker
DataNode
M/R Task
Node Mgr
DataNode
MR Task
Compute Cluster HBase Cluster
Gateway/Launcher
Rest Proxy
HTTP
Client
Workload Jungle
Multi-tenancy
Multi-tenancy at Scale
• 35 Tenants
• 800 RegionServers
• 300k regions
• RS Peak 115k requests/sec
Divide and Conquer
RS RS…Group A RS
RS RS…Group B RS
RS RS…Group C RS
RS RS…Group D RS
RS RS…Group E RS
RegionServer Groups
• Group Membership
• Table
• RegionServer
• Coarse Isolation
• Group customization
• Namespace integration
Multi-tenancy at Scale
• 800 RegionServers
• 40 namespaces
• 40 Region server groups
• 4 to 100s of servers
• Up to 2000+ regions per server
• ~1 week rolling upgrade
Scaling to 10’s of PBs (and Beyond)
• Scale to Millions of Regions (and Beyond)
• Avoid large regions
• Data Locality
• Network utilization
• Datanode load
• Performance
• Region directories under table directory
• HDFS data structure bottleneck
• Namenode Hard Limit of ~6.7 Million
Filesystem Layout
Create file ops for 5M Region Table
Filesystem Layout
• Hierarchical Table Layout
Filesystem Layout
Performance Comparison
Test 1M Regions 5M Regions 10M
Regions
Normal Table 20 mins 4 hours 23
mins
DNF
Humongous 15 mins 48
secs
1 hour 27
mins
2 hours 53
mins
Region directory creation time
▪ Lock Thrashing
▪ ZK bottlenecks
› List/Mutate Millions of Znodes
› Notification firehose
▪ State is kept in 3 places
› Cached in master
› Zookeeper
› Meta
ZK Region Assignment
RS
Master
Zookeeper
Meta
Region 1
Region 2
RS
ZKLess Region Assignment
▪ ZK no longer involved
▪ Master approves all assignment
▪ State is persisted only in Meta
▪ State is updated by the Master
Meta region
RS
Master Region 1
Region 2
RS
Performance Comparison
Test Latency
ZK 1hr 16mins
ZK w/o force-sync 11mins
ZKLess 11mins
Assignment Time for 1 Million Regions
Single Meta Region
▪ Meta not splittable
▪ Large compactions
▪ Longer failover times
Splittable Meta Table
▪ Scale Horizontally
› I/O load
› Caching
› RPC Load
Performance Comparison
Scan Meta Assignment Total
1 Meta / 1 RS 56min 19.79min 75.79min
1 Meta / 1 RS 58.63min 28.16min 86.79min
32 Meta / 3
RS
2.91min 12.56min 15.47min
32 Meta / 3
RS
3.6min 12.54min 16.4min
Assignment Time for 3 Million Regions
Data Locality
▪ HDFS
› Hadoop Distributed Filesystem
▪ Region Server
› Serves Regions
› Locality of a Region’s Data blocks
Favored Nodes
▪ HDFS
› Dictate block placement on file creation
▪ HBase
› Partially completed in Apache HBase
› Select 3 favored nodes per Region
› 1 Node on-rack, 2 Node off-rack
› Restrict Region Assignment
Favored Nodes – Fault Testing
Control Favored Nodes
THANK YOU
Icon Courtesy – iconfinder.com (under Creative Commons)

More Related Content

Keynote: Apache HBase at Yahoo! Scale

  • 1. Apache HBase at Yahoo Scale PUSHING THE LIMITS Francis Liu HBase Yahoo
  • 3. HBase @ Y! • Hosted multi-tenant clusters • 3 Production • 3 Sandbox • HBase-only • Off-stage Use Cases • Internal 0.98 releases • Security HBase Client HBase Client Resource Mgr Namenode TaskTracker DataNode Namenode RegionServer DataNode RegionServer DataNode RegionServer DataNode HBase Master Zookeeper Quorum HBase Client MR Client M/R Task TaskTracker DataNode M/R Task Node Mgr DataNode MR Task Compute Cluster HBase Cluster Gateway/Launcher Rest Proxy HTTP Client
  • 6. Multi-tenancy at Scale • 35 Tenants • 800 RegionServers • 300k regions • RS Peak 115k requests/sec
  • 7. Divide and Conquer RS RS…Group A RS RS RS…Group B RS RS RS…Group C RS RS RS…Group D RS RS RS…Group E RS
  • 8. RegionServer Groups • Group Membership • Table • RegionServer • Coarse Isolation • Group customization • Namespace integration
  • 9. Multi-tenancy at Scale • 800 RegionServers • 40 namespaces • 40 Region server groups • 4 to 100s of servers • Up to 2000+ regions per server • ~1 week rolling upgrade
  • 10. Scaling to 10’s of PBs (and Beyond) • Scale to Millions of Regions (and Beyond) • Avoid large regions • Data Locality • Network utilization • Datanode load • Performance
  • 11. • Region directories under table directory • HDFS data structure bottleneck • Namenode Hard Limit of ~6.7 Million Filesystem Layout
  • 12. Create file ops for 5M Region Table Filesystem Layout
  • 13. • Hierarchical Table Layout Filesystem Layout
  • 14. Performance Comparison Test 1M Regions 5M Regions 10M Regions Normal Table 20 mins 4 hours 23 mins DNF Humongous 15 mins 48 secs 1 hour 27 mins 2 hours 53 mins Region directory creation time
  • 15. ▪ Lock Thrashing ▪ ZK bottlenecks › List/Mutate Millions of Znodes › Notification firehose ▪ State is kept in 3 places › Cached in master › Zookeeper › Meta ZK Region Assignment RS Master Zookeeper Meta Region 1 Region 2 RS
  • 16. ZKLess Region Assignment ▪ ZK no longer involved ▪ Master approves all assignment ▪ State is persisted only in Meta ▪ State is updated by the Master Meta region RS Master Region 1 Region 2 RS
  • 17. Performance Comparison Test Latency ZK 1hr 16mins ZK w/o force-sync 11mins ZKLess 11mins Assignment Time for 1 Million Regions
  • 18. Single Meta Region ▪ Meta not splittable ▪ Large compactions ▪ Longer failover times
  • 19. Splittable Meta Table ▪ Scale Horizontally › I/O load › Caching › RPC Load
  • 20. Performance Comparison Scan Meta Assignment Total 1 Meta / 1 RS 56min 19.79min 75.79min 1 Meta / 1 RS 58.63min 28.16min 86.79min 32 Meta / 3 RS 2.91min 12.56min 15.47min 32 Meta / 3 RS 3.6min 12.54min 16.4min Assignment Time for 3 Million Regions
  • 21. Data Locality ▪ HDFS › Hadoop Distributed Filesystem ▪ Region Server › Serves Regions › Locality of a Region’s Data blocks
  • 22. Favored Nodes ▪ HDFS › Dictate block placement on file creation ▪ HBase › Partially completed in Apache HBase › Select 3 favored nodes per Region › 1 Node on-rack, 2 Node off-rack › Restrict Region Assignment
  • 23. Favored Nodes – Fault Testing Control Favored Nodes
  • 24. THANK YOU Icon Courtesy – iconfinder.com (under Creative Commons)