SlideShare a Scribd company logo
Building a Messaging Solutions for OVHcloud
with Apache Pulsar
Pierre Zemb
Technical Leader
Pulsar Summit 2020
$ whoami
● Pierre Zemb (@PierreZ)
● Technical Leader
● Working around distributed systems
● Apache contributor
○ HBase, Flink, Pulsar
Involved into local dev communities
2
Schedule
1. What is OVHcloud?
2. The need of a Messaging Solutions
3. The choice of Apache Pulsar
4. Overview of our infrastructure
5. Overview of our management layer
6. The quest to support Apache Kafka
7. Our ideas for the future
3
OVHcloud, a Global Cloud Provider
● 30 data centers globally
● Our own high-quality global
network, committed to the
highest security standards
● NSX and vRack Secure your
platform with micro-
segmentation of private L2
that spans global data
centers
● SSL Gateway Service: Up to
10,000 concurrent
connections. Optional
Anycast DNS service.
● Highest compliance and
certification standards
● Anti-DDoS: Highly resilient
Layer 4-7 DDoS protection
built into the network
Providing a platform
Compute
Providing a platform
Compute messaging?
Let’s build a messaging solution! Been there...
● OVHcloud started a beta called “Queue as a service” in 2015
● Based on Apache Kafka
● Multi-tenant cluster
● Beta closed in 2018
● Massively used internally
What we learn from Queue As a Service
From users:
● Users wants not only Kafka, but queing as well:
○ RabbitMQ
○ MQTT
○ ...
● They want to support old versions of Kafka’s protocol
● Data encryption?
What we learn from Apache Kafka
From us:
● No built-in {multi-tenancy, geo-replication}
● Creating a topic is not cost-free
● infinite retention isn't possible
● no tiered storage
● Operations are not very convenient
○ we cannot "just" scale storage
○ a consumer reading old data can slow down the whole
broker
What we learn from Apache Kafka
Disclaimer:
● We ♥ Apache Kada
● We have far more messages in Kada than Pulsar within OVHcloud
● For certain use cases, we need an alternaeve
Let’s build a messaging solution!
Messaging system
Pulsar Kafka RabbitMQ ...
Messaging solueon
What we choose
as an infrastructure
provider
What we
are exposing
to customers
Let’s build a messaging solution!
Requirements for the foundation of a messaging solution:
● has multi-tenancy
● can be used for queuing and streaming
● can be easily extend
● has lower operational cost at scale
Building a Messaging Solutions for OVHcloud with Apache Pulsar_Pierre Zemb
Apache Pulsar’s TL;DR
❏ What Pulsar Provides
✓ Mul$-Tenancy
✓ Security
✓ TLS Encryp$on
✓ Authen$ca$on, Authoriza$on
✓ Geo-replica$on
✓ Queuing and streaming seman$cs
✓ Tiered storage
✓ Schema
✓ Integra$ons with big data ecosystem (Flink / Spark / Presto)
Let's deploy Apache Pulsar!
🚀
Our deployment
TODO drawing: remove producer
Add pulsar-proxy above pulsar-broker
Add haproxy above pulsar-proxy
Bookkeeper's tuning
● Enabled
○ Z Garbage Collector, also known as ZGC
○ Prometheus exporter
● configured:
○ multiple journalDirectory to better exploit SSD throughput
○ one ledgerDirectory per HDD
Pulsar's configuration
● Started with
○ 3 bookies to use when creating a ledger (ensemble)
○ 3 copies to store for each message
(writeQuorum)
○ 2 guaranteed copies
(ackQuorum)
● Now running 4/2/2 layouts
○ Increase striped writes
Lesson learned:
avoid having the ensemble equals to the number of bookies
Some benchmark!
Sending a small string as value as fast as we can from 8 VMs to two partitions
1.8 millions of msg/s/partitions
Some benchmark!
Bookkeeper outage
Lesson learned: learn Bookkeeper's CLI
Lesson learned: learn Bookkeeper's CLI
Meet Bookkeeper's friend: the Auditor
Meet Bookkeeper's friend: the Auditor
Meet Bookkeeper's friend: the Auditor
Let's manage Apache Pulsar!
🚀
Our management layer
Management
µservice
● create topic
● create tokens
● set retention
● ...
Sync
Our management layer
● Written in Go
● Cluster-aware
● Push topic's configuration to clusters
● Pull topic's usage from clusters
● Generate valid JWT's token
Our management layer
● WriFen in Go
● Cluster-aware
● Push topic's configuraHon to clusters
● Pull topic's usage from clusters
● Generate valid JWT's token
Lessons learned:
Pulling topics usage is costly, we should report them to management (PIP?)
How we handled tenancy
Geo replication, tiered-storage, retention and others are on the namespace-level
We ended up mapping topic to namespaces, which result in using one namespace
per topic
We closed the admin API to our users to enforce this behavior
How we will improve tenancy
https://github.com/apache/pulsar/wiki/PIP-39:-Namespace-Change-Events
How we will improve tenancy
Lesson learned:
closing the admin API is costly, as we need to rewrite all calls only to forward
them
PIP-39 + cluster usage report = opening back admin API
Opening ioStream beta!
Now we have our messaging system
Now we have our messaging system
Let's start Kafka-proxy!
Kafka-proxy, OVHcloud version
We first implemented KoP has a proxy PoC in Rust:
● Rust async was out in nightly compiler when we started
● We wanted no GC on proxy layers
● Rust has awesome libraries at TCP-level
Our goal was to convert TCP frames from KaSa to Pulsar
Kafka-proxy, OVHcloud version
Kafka-proxy, OVHcloud version
Kafka-proxy, OVHcloud version
Kafka-proxy, OVHcloud version
Kafka-proxy, OVHcloud version
● Working at TCP layer enables performance
● nice PoC to discover both protocols
● Rust is blazing fast
● Proxify production is easy
● We could bump old version of Kafka frames for
old Kafka clients
● Rewrite everything
● Some things were hard to proxify:
○ Group coordinator
○ Offsets management
● Difficult to open-source (different language)
And then we saw this 😍
Apache Pulsar's protocol handler
https://www.ovh.com/blog/announcing-kafka-on-pulsar-bring-native-kafka-protocol-support-to-apache-pulsar/
Apache Pulsar's protocol handler
Apache Pulsar's protocol handler
Apache Pulsar's protocol handler
Apache Pulsar's protocol handler
Apache Pulsar’s TL;DR
❏ What Pulsar Provides
✓ Multi-Tenancy
✓ Security
✓ TLS Encryption
✓ Authentication, Authorization
✓ Geo-replication
✓ Queuing and streaming semantics
✓ Tiered storage
✓ Schema
✓ Integrations with big data ecosystem (Flink / Spark / Presto)
Apache Pulsar’s TL;DR
❏ What Pulsar Provides
✓ Mul$-Tenancy
✓ Security
✓ TLS Encryp$on
✓ Authen$ca$on, Authoriza$on
✓ Geo-replica$on
✓ Queuing and streaming seman$cs
✓ Tiered storage
✓ Schema
✓ Integra$ons with big data ecosystem (Flink / Spark / Presto)
✓ Addi$onal ecosystems
✓ KaSa
Thanks!
Do you have questions?
Slides
Twitter
Github
https://pierrezemb.fr
PierreZ
PierreZ
Bonus 😍
Our deployment
Bookkeeper
STOR-2
● Intel Xeon-D 1541
● 32GB DDR4 ECC
● 4x HDD 12TB
● 2x SSD 240GB
Pulsar
ADVANCE-4
● AMD Epyc 7351P
● 128GB DDR4 ECC
● 2x SSD NVMe RAID
Our ideas for the future
● Open back the admin API
○ Will allow users to easily use features like
■ schema, $ered-storage, geo replica$on, ...
■ order topics from code, ...
● Upgrade cluster
● Deploy Presto, KoP and WebSockets
● More protocols!
● Add encryp$on on Bookkeeper's layer
● Create "managed-topics"
○ a special namespace with topics populated by OVHcloud
○ See events/logs from other products

More Related Content

Building a Messaging Solutions for OVHcloud with Apache Pulsar_Pierre Zemb

  • 1. Building a Messaging Solutions for OVHcloud with Apache Pulsar Pierre Zemb Technical Leader Pulsar Summit 2020
  • 2. $ whoami ● Pierre Zemb (@PierreZ) ● Technical Leader ● Working around distributed systems ● Apache contributor ○ HBase, Flink, Pulsar Involved into local dev communities 2
  • 3. Schedule 1. What is OVHcloud? 2. The need of a Messaging Solutions 3. The choice of Apache Pulsar 4. Overview of our infrastructure 5. Overview of our management layer 6. The quest to support Apache Kafka 7. Our ideas for the future 3
  • 4. OVHcloud, a Global Cloud Provider ● 30 data centers globally ● Our own high-quality global network, committed to the highest security standards ● NSX and vRack Secure your platform with micro- segmentation of private L2 that spans global data centers ● SSL Gateway Service: Up to 10,000 concurrent connections. Optional Anycast DNS service. ● Highest compliance and certification standards ● Anti-DDoS: Highly resilient Layer 4-7 DDoS protection built into the network
  • 7. Let’s build a messaging solution! Been there... ● OVHcloud started a beta called “Queue as a service” in 2015 ● Based on Apache Kafka ● Multi-tenant cluster ● Beta closed in 2018 ● Massively used internally
  • 8. What we learn from Queue As a Service From users: ● Users wants not only Kafka, but queing as well: ○ RabbitMQ ○ MQTT ○ ... ● They want to support old versions of Kafka’s protocol ● Data encryption?
  • 9. What we learn from Apache Kafka From us: ● No built-in {multi-tenancy, geo-replication} ● Creating a topic is not cost-free ● infinite retention isn't possible ● no tiered storage ● Operations are not very convenient ○ we cannot "just" scale storage ○ a consumer reading old data can slow down the whole broker
  • 10. What we learn from Apache Kafka Disclaimer: ● We ♥ Apache Kada ● We have far more messages in Kada than Pulsar within OVHcloud ● For certain use cases, we need an alternaeve
  • 11. Let’s build a messaging solution! Messaging system Pulsar Kafka RabbitMQ ... Messaging solueon What we choose as an infrastructure provider What we are exposing to customers
  • 12. Let’s build a messaging solution! Requirements for the foundation of a messaging solution: ● has multi-tenancy ● can be used for queuing and streaming ● can be easily extend ● has lower operational cost at scale
  • 14. Apache Pulsar’s TL;DR ❏ What Pulsar Provides ✓ Mul$-Tenancy ✓ Security ✓ TLS Encryp$on ✓ Authen$ca$on, Authoriza$on ✓ Geo-replica$on ✓ Queuing and streaming seman$cs ✓ Tiered storage ✓ Schema ✓ Integra$ons with big data ecosystem (Flink / Spark / Presto)
  • 15. Let's deploy Apache Pulsar! 🚀
  • 16. Our deployment TODO drawing: remove producer Add pulsar-proxy above pulsar-broker Add haproxy above pulsar-proxy
  • 17. Bookkeeper's tuning ● Enabled ○ Z Garbage Collector, also known as ZGC ○ Prometheus exporter ● configured: ○ multiple journalDirectory to better exploit SSD throughput ○ one ledgerDirectory per HDD
  • 18. Pulsar's configuration ● Started with ○ 3 bookies to use when creating a ledger (ensemble) ○ 3 copies to store for each message (writeQuorum) ○ 2 guaranteed copies (ackQuorum) ● Now running 4/2/2 layouts ○ Increase striped writes Lesson learned: avoid having the ensemble equals to the number of bookies
  • 19. Some benchmark! Sending a small string as value as fast as we can from 8 VMs to two partitions 1.8 millions of msg/s/partitions
  • 21. Lesson learned: learn Bookkeeper's CLI
  • 22. Lesson learned: learn Bookkeeper's CLI
  • 26. Let's manage Apache Pulsar! 🚀
  • 27. Our management layer Management µservice ● create topic ● create tokens ● set retention ● ... Sync
  • 28. Our management layer ● Written in Go ● Cluster-aware ● Push topic's configuration to clusters ● Pull topic's usage from clusters ● Generate valid JWT's token
  • 29. Our management layer ● WriFen in Go ● Cluster-aware ● Push topic's configuraHon to clusters ● Pull topic's usage from clusters ● Generate valid JWT's token Lessons learned: Pulling topics usage is costly, we should report them to management (PIP?)
  • 30. How we handled tenancy Geo replication, tiered-storage, retention and others are on the namespace-level We ended up mapping topic to namespaces, which result in using one namespace per topic We closed the admin API to our users to enforce this behavior
  • 31. How we will improve tenancy https://github.com/apache/pulsar/wiki/PIP-39:-Namespace-Change-Events
  • 32. How we will improve tenancy Lesson learned: closing the admin API is costly, as we need to rewrite all calls only to forward them PIP-39 + cluster usage report = opening back admin API
  • 34. Now we have our messaging system
  • 35. Now we have our messaging system
  • 37. Kafka-proxy, OVHcloud version We first implemented KoP has a proxy PoC in Rust: ● Rust async was out in nightly compiler when we started ● We wanted no GC on proxy layers ● Rust has awesome libraries at TCP-level Our goal was to convert TCP frames from KaSa to Pulsar
  • 42. Kafka-proxy, OVHcloud version ● Working at TCP layer enables performance ● nice PoC to discover both protocols ● Rust is blazing fast ● Proxify production is easy ● We could bump old version of Kafka frames for old Kafka clients ● Rewrite everything ● Some things were hard to proxify: ○ Group coordinator ○ Offsets management ● Difficult to open-source (different language)
  • 43. And then we saw this 😍
  • 44. Apache Pulsar's protocol handler https://www.ovh.com/blog/announcing-kafka-on-pulsar-bring-native-kafka-protocol-support-to-apache-pulsar/
  • 49. Apache Pulsar’s TL;DR ❏ What Pulsar Provides ✓ Multi-Tenancy ✓ Security ✓ TLS Encryption ✓ Authentication, Authorization ✓ Geo-replication ✓ Queuing and streaming semantics ✓ Tiered storage ✓ Schema ✓ Integrations with big data ecosystem (Flink / Spark / Presto)
  • 50. Apache Pulsar’s TL;DR ❏ What Pulsar Provides ✓ Mul$-Tenancy ✓ Security ✓ TLS Encryp$on ✓ Authen$ca$on, Authoriza$on ✓ Geo-replica$on ✓ Queuing and streaming seman$cs ✓ Tiered storage ✓ Schema ✓ Integra$ons with big data ecosystem (Flink / Spark / Presto) ✓ Addi$onal ecosystems ✓ KaSa
  • 51. Thanks! Do you have questions? Slides Twitter Github https://pierrezemb.fr PierreZ PierreZ
  • 53. Our deployment Bookkeeper STOR-2 ● Intel Xeon-D 1541 ● 32GB DDR4 ECC ● 4x HDD 12TB ● 2x SSD 240GB Pulsar ADVANCE-4 ● AMD Epyc 7351P ● 128GB DDR4 ECC ● 2x SSD NVMe RAID
  • 54. Our ideas for the future ● Open back the admin API ○ Will allow users to easily use features like ■ schema, $ered-storage, geo replica$on, ... ■ order topics from code, ... ● Upgrade cluster ● Deploy Presto, KoP and WebSockets ● More protocols! ● Add encryp$on on Bookkeeper's layer ● Create "managed-topics" ○ a special namespace with topics populated by OVHcloud ○ See events/logs from other products