SlideShare a Scribd company logo
Confidential and proprietary materials for authorized Verizon personnel and outside agencies only. Use, disclosure or
distribution of this material is not permitted to any unauthorized persons or third parties except by written agreement.
Pulsar Storage on BookKeeper
Seamless Evolution
June 17, 2020
Joe Francis joef@verizonmedia.com
Rajan Dhabalia rdhabalia@verizonmedia.com
Speakers
2
Joe Francis
Director, Verizon Media
Rajan Dhabalia
Principal Software Engineer, Verizon Media
Agenda
● Pulsar in Verizon Media
● Benchmarking for production use
● Pulsar IO Isolation
● BookKeeper with different storage devices
● Case-study: Kafka use case on Pulsar
● Future
3
Verizon Media & Pulsar
● Developed as a hosted pub-sub service within Yahoo/VMG
○ open-sourced in 2016
● Global deployment
○ 6 DC (Asia, Europe, US)
○ full mesh replication
● Mission critical use cases
○ Serving applications
○ Lower latency bus for use by other low latency services
○ Write availability
4
● Most benchmark numbers do not test production scenarios
○ Messaging systems work well when
■ data fits in memory
■ no disk I/O in critical path (write or read)
● Pulsar was designed to work well under real world work load..
○ Lagging consumers, replay
■ Backlog read from disks will occur.
○ Disks and brokers crash/fail
■ Pulsar ack guarantee: data is synced to disk on 2+ hosts
○ Latencies remain unaffected by load variations
■ backlog reads (I/O isolation)
■ failures (instantaneous recovery)
● Cost matters
○ Compute ($) vs Storage ($$)
● Benchmark for production use !!!
5
Benchmarking for production
6
Data paths
RAM
Journal
Data
Broker
( Cache: RAM)
Bookie
Application
Producer
ackack
RAM
Journal
Data
Bookie
ack
Application
Consumer
Application
Consumer
Cold
Reads
BookKeeper IO Isolation
7
Pulsar Journey
8
- HDD
- Fast low latency sequential writes on HDD with battery backed RAID controller
- Random seek time is much longer for HDD
- Economical
- Journal Device
- Fast sequential writes
- Ledger Device
- Sequential writes on single entry-log data file for multiple streams
- Most of the IOPs is utilized for
- Backlog draining (cold reads)
- Reads and writes on Index files
First Generation Storage - HDD
9
- JOURNAL-Device HDD with RAID10
- DATA-Device HDD with RAID10
- Index: Interleaved index files
Optimizing random IOs for Indexing
10
- Index on interleaved file
- One index file for each topic
- Random IO while updating index
- Scaling number of topics increases random IOs and file handles
- Index on Rocks DB
- LSM based embedded key-value store
- Used as a library within bookie process; no additional operational efforts
- Less write-amplification and better compression
- Drastically reduces random IOPs for indexing
- Small footprint ( < 10 GB); mostly in RAM
Second Generation: SSD/NVMe
11
SSD/NVMe
- SSD provides better performance for sequential and random I/O
- NVMe supports large command queue (64K) with parallel IO
Journal Device
- Bookie can use multiple journal directories to utilize parallel write on NVMe
- Achieve 3x Pulsar throughput with low latency, compared to HDD
Ledger Device
- Significantly faster random reads than HDD
- Faster backlog draining while doing cold reads for multiple topics
- JOURNAL-Device NVMe/SSD
- DATA-Device NVMe/SSD
- Index: RocksDB
Storage Device: Sequential Vs Random IO
12
Storage Device: Performance Vs Cost
13
Storage Evolution & Pulsar Adaptation: PMEM
14
PMEM
● Highest performing block storage device
● Ultra fast, super high throughput with consistent low latency
● Expensive; well suited as small device for WRITE intensive use cases
Journal Device
● WAL/journal is proven design in Databases
○ transactional storage and recovery
○ high throughput
● Write optimized append only structure
● Does not require much storage and keeps short lived transactional data
● Using PMEM for journal device
○ adds < 5% cost for each bookie
○ Increases Pulsar throughput 5x times, and with low publish latency
Pulsar Performance with Different BK-Journal Device
15
Performance configuration
● Enabled fsync on every
published message
● Publish throughput with
backlog draining
● SLA: 5ms (99%lie latency)
○ HDD: 120MB
○ SSD: 200MB
○ NVMe: 350MB
○ PMEM: 600MB
● Cost and Throughput
■ Using PMEM for journal adds < 5% more cost per host but reduce
overall cost and cluster footprints
■ Achieve 5x more throughput with 99%-ile @ <5ms write latency
● Cluster footprint
■ Kafka cluster : 33 Kafka Brokers
■ Pulsar cluster: 10 bookies and 16 brokers
● Pulsar broker is a stateless component and costs 1/4x than
bookie
■ Overall Pulsar cluster resources ½ of the Kafka cluster
Case-study: Migrate Kafka Use Case to Pulsar
16
Case-study: Migrate Kafka Use Case to Pulsar
17
USE CASES APACHE PULSAR APACHE KAFKA
Throughput with low latency
Cost
Geo-replication
Queuing
Committing messages
Future
● Use PMDK API to access persistent memory
○ bypass the file system
○ better throughput
● Tiered Storage for historical data use cases
○ relaxed latency requirements
○ cheaper cost
○ Use cases
■ ML model training
■ audit, forensics
18
Thank you
Joe Francis joef@verizonmedia.com
Rajan Dhabalia rdhabalia@verizonmedia.com

More Related Content

Pulsar Storage on BookKeeper _Seamless Evolution

  • 1. Confidential and proprietary materials for authorized Verizon personnel and outside agencies only. Use, disclosure or distribution of this material is not permitted to any unauthorized persons or third parties except by written agreement. Pulsar Storage on BookKeeper Seamless Evolution June 17, 2020 Joe Francis joef@verizonmedia.com Rajan Dhabalia rdhabalia@verizonmedia.com
  • 2. Speakers 2 Joe Francis Director, Verizon Media Rajan Dhabalia Principal Software Engineer, Verizon Media
  • 3. Agenda ● Pulsar in Verizon Media ● Benchmarking for production use ● Pulsar IO Isolation ● BookKeeper with different storage devices ● Case-study: Kafka use case on Pulsar ● Future 3
  • 4. Verizon Media & Pulsar ● Developed as a hosted pub-sub service within Yahoo/VMG ○ open-sourced in 2016 ● Global deployment ○ 6 DC (Asia, Europe, US) ○ full mesh replication ● Mission critical use cases ○ Serving applications ○ Lower latency bus for use by other low latency services ○ Write availability 4
  • 5. ● Most benchmark numbers do not test production scenarios ○ Messaging systems work well when ■ data fits in memory ■ no disk I/O in critical path (write or read) ● Pulsar was designed to work well under real world work load.. ○ Lagging consumers, replay ■ Backlog read from disks will occur. ○ Disks and brokers crash/fail ■ Pulsar ack guarantee: data is synced to disk on 2+ hosts ○ Latencies remain unaffected by load variations ■ backlog reads (I/O isolation) ■ failures (instantaneous recovery) ● Cost matters ○ Compute ($) vs Storage ($$) ● Benchmark for production use !!! 5 Benchmarking for production
  • 6. 6 Data paths RAM Journal Data Broker ( Cache: RAM) Bookie Application Producer ackack RAM Journal Data Bookie ack Application Consumer Application Consumer Cold Reads
  • 9. - HDD - Fast low latency sequential writes on HDD with battery backed RAID controller - Random seek time is much longer for HDD - Economical - Journal Device - Fast sequential writes - Ledger Device - Sequential writes on single entry-log data file for multiple streams - Most of the IOPs is utilized for - Backlog draining (cold reads) - Reads and writes on Index files First Generation Storage - HDD 9 - JOURNAL-Device HDD with RAID10 - DATA-Device HDD with RAID10 - Index: Interleaved index files
  • 10. Optimizing random IOs for Indexing 10 - Index on interleaved file - One index file for each topic - Random IO while updating index - Scaling number of topics increases random IOs and file handles - Index on Rocks DB - LSM based embedded key-value store - Used as a library within bookie process; no additional operational efforts - Less write-amplification and better compression - Drastically reduces random IOPs for indexing - Small footprint ( < 10 GB); mostly in RAM
  • 11. Second Generation: SSD/NVMe 11 SSD/NVMe - SSD provides better performance for sequential and random I/O - NVMe supports large command queue (64K) with parallel IO Journal Device - Bookie can use multiple journal directories to utilize parallel write on NVMe - Achieve 3x Pulsar throughput with low latency, compared to HDD Ledger Device - Significantly faster random reads than HDD - Faster backlog draining while doing cold reads for multiple topics - JOURNAL-Device NVMe/SSD - DATA-Device NVMe/SSD - Index: RocksDB
  • 12. Storage Device: Sequential Vs Random IO 12
  • 14. Storage Evolution & Pulsar Adaptation: PMEM 14 PMEM ● Highest performing block storage device ● Ultra fast, super high throughput with consistent low latency ● Expensive; well suited as small device for WRITE intensive use cases Journal Device ● WAL/journal is proven design in Databases ○ transactional storage and recovery ○ high throughput ● Write optimized append only structure ● Does not require much storage and keeps short lived transactional data ● Using PMEM for journal device ○ adds < 5% cost for each bookie ○ Increases Pulsar throughput 5x times, and with low publish latency
  • 15. Pulsar Performance with Different BK-Journal Device 15 Performance configuration ● Enabled fsync on every published message ● Publish throughput with backlog draining ● SLA: 5ms (99%lie latency) ○ HDD: 120MB ○ SSD: 200MB ○ NVMe: 350MB ○ PMEM: 600MB
  • 16. ● Cost and Throughput ■ Using PMEM for journal adds < 5% more cost per host but reduce overall cost and cluster footprints ■ Achieve 5x more throughput with 99%-ile @ <5ms write latency ● Cluster footprint ■ Kafka cluster : 33 Kafka Brokers ■ Pulsar cluster: 10 bookies and 16 brokers ● Pulsar broker is a stateless component and costs 1/4x than bookie ■ Overall Pulsar cluster resources ½ of the Kafka cluster Case-study: Migrate Kafka Use Case to Pulsar 16
  • 17. Case-study: Migrate Kafka Use Case to Pulsar 17 USE CASES APACHE PULSAR APACHE KAFKA Throughput with low latency Cost Geo-replication Queuing Committing messages
  • 18. Future ● Use PMDK API to access persistent memory ○ bypass the file system ○ better throughput ● Tiered Storage for historical data use cases ○ relaxed latency requirements ○ cheaper cost ○ Use cases ■ ML model training ■ audit, forensics 18
  • 19. Thank you Joe Francis joef@verizonmedia.com Rajan Dhabalia rdhabalia@verizonmedia.com