SlideShare a Scribd company logo
How Orange Financial combat financial
frauds over 50M transactions a day using
Apache Pulsar
Vincent Xie (Bestpay), Jia Zhai (StreamNative)
About us
Vincent (Weisheng) Xie
❏ Current Director @ Orange Financial
❏ Previous Tech lead of ML engineering
team @ Intel
Jia Zhai
❏ Co-Founder of StreamNative
❏ Apache Pulsar PMC Member
❏ Apache BookKeeper PMC Member
Agenda
❏ Background
❏ Apache Pulsar
❏ Unified Data Processing
❏ Our Practices
❏ Q & A
Background Intro
Orange Financial
Orange Financial Services Group (Chinese: 甜橙金融), formerly known as Bestpay, is an affiliate company of
China Telecom. It reached 1.13 trillion CNY ($18.37 Billion) transaction volume in 2018, with 500 million registered
users and 41.9 million active users.
Subsidiaries:
Bestpay - a mobile wallet and payment app
Jieqian - a consumer loan service
Orange Wealth
Orange Insurance
Orange Credit
Orange Financial Cloud
How Orange Financial combat financial frauds over 50M transactions a day using Apache Pulsar
Source: iiMedia Research Inc.
High Industry Penetration Rate
Source: China Unionpay
Source: RSA
Challenges
❏ High concurrency
❏ > 50M transactions, 1 billion events a day (peek: 35K/s)
❏ Low latency demand
❏ response < 200ms
❏ Large number of batch jobs and streaming jobs
“A merchant’s total transaction volume ($) within the past month (30days)
(current transaction included)”
= sum($past_29days) + sum($today_upto_current)
batch streaming
Architecture V1
API
Gateway
Batch Layer
Speed/Streaming Layer
Architecture V1 - Lambda
API
Gateway
Serving
Layer
Drawbacks
❏ S/W stacks complexity
❏ Realtime / Offline / Serving stacks
❏ Multiple clusters to maintain (Kafka / Hive / Spark / Flink)
❏ Different skill sets to manipulate (Scala / Java / SQL)
❏ Segmented Logics
❏ Historical/Current
❏ Data redundancy
❏ Multiple duplications to move over
Introduce Apache Pulsar
What is Apache Pulsar?
“Flexible Pub/Sub Messaging
Backed by durable log storage”
Pulsar - A cloud-native architecture
Stateless Serving
Durable Storage
Pulsar - Segment Centric Storage
❏ Topic Partition (Managed Ledger)
❏ The storage layer for a single topic
partition
❏ Segment (Ledger)
❏ Single writer, append-only
❏ Replicated to multiple bookies
Pulsar - Pub/Sub
Pulsar - Topic Partitions
Pulsar - Segments
Pulsar - Stream
Pulsar - Stream as a unified view on data
Pulsar - Two levels of reading API
❏ Pub/Sub (Streaming)
❏ Read data from brokers
❏ Consume / Seek / Receive
❏ Subscription Mode - Failover, Shared, Key_Shared
❏ Reprocessing data by rewinding (seeking) the cursors
❏ Segment (Batch)
❏ Read data from storage (bookkeeper or tiered storage)
❏ Fine-grained Parallelism
❏ Predicate pushdown (publish timestamp)
Unified Data Processing on Pulsar
Architecture V2
API
Gateway
Spark Structured
Streaming
Spark SQL
Architecture V2
API
Gateway
Spark Structured
Streaming
Spark SQL
❏ Single Data Store (Pulsar)
❏ Single Computing Engine (Spark)
❏ Unified API
Pulsar-Spark
❏ Deeply integrated with Pulsar schema
❏ Pulsar topics as Structured Streams
❏ Pulsar Connectors for Spark Structured Streaming
❏ Pulsar Connectors for Spark SQL
https://github.com/streamnative/pulsar-spark
Pulsar-Spark / Streaming Queries
https://github.com/streamnative/pulsar-spark
Pulsar-Spark / Batch Queries
https://github.com/streamnative/pulsar-spark
Pulsar-Spark / Write Results to Pulsar
https://github.com/streamnative/pulsar-spark
PoC at Bestpay
❏ Ingest data to Pulsar
❏ Realtime Data
❏ pulsar-io-kafka: connect kafka messages (JSON) to Pulsar
and store them in AVRO format with schema information
❏ Historic Data
❏ pulsar-spark: query the Hive table and insert Hive rows as
Pulsar messages (AVRO) to Pulsar
❏ Data Processing
❏ Spark Structured Streaming: for stream processing
❏ Spark SQL: for batch processing and interactive queries
Benefits
❏ Complexity drop 33% (Number of clusters from 6 down to 4)
❏ Storage saving 8.7% (expect to be 28%)
❏ Time to production boosts 11x (backed with streaming SQL)
❏ Higher stability (expected)
Summary
❏ Apache Pulsar is a cloud-native messaging streaming system
❏ Multi layered architecture
❏ Segment centric storage
❏ Two levels of reading API: Pub/Sub + Segment
❏ Apache Pulsar provides a unified view of data
❏ Pulsar + Spark for a simple unified data processing
References
❏ pulsar-io-kafka: https://github.com/streamnative/pulsar-io-kafka
❏ pulsar-spark: https://github.com/streamnative/pulsar-spark
❏ Apache Pulsar as One Storage System for Both Real-time and Historical Data
Analysis:
https://medium.com/streamnative/apache-pulsar-as-one-storage-455222c590
17
Community
❏ Pulsar Website: https://pulsar.apache.org
❏ Twitter: @apache_pulsar / @streamnativeio
❏ Slack: https://apache-pulsar.herokuapp.com
❏ Mailing Lists
dev@pulsar.apache.org, users@pulsar.apache.org
❏ Github
https://github.com/apache/pulsar
❏ Medium
https://medium.com/streamnative
Thanks!

More Related Content

How Orange Financial combat financial frauds over 50M transactions a day using Apache Pulsar

  • 1. How Orange Financial combat financial frauds over 50M transactions a day using Apache Pulsar Vincent Xie (Bestpay), Jia Zhai (StreamNative)
  • 2. About us Vincent (Weisheng) Xie ❏ Current Director @ Orange Financial ❏ Previous Tech lead of ML engineering team @ Intel Jia Zhai ❏ Co-Founder of StreamNative ❏ Apache Pulsar PMC Member ❏ Apache BookKeeper PMC Member
  • 3. Agenda ❏ Background ❏ Apache Pulsar ❏ Unified Data Processing ❏ Our Practices ❏ Q & A
  • 5. Orange Financial Orange Financial Services Group (Chinese: 甜橙金融), formerly known as Bestpay, is an affiliate company of China Telecom. It reached 1.13 trillion CNY ($18.37 Billion) transaction volume in 2018, with 500 million registered users and 41.9 million active users. Subsidiaries: Bestpay - a mobile wallet and payment app Jieqian - a consumer loan service Orange Wealth Orange Insurance Orange Credit Orange Financial Cloud
  • 8. High Industry Penetration Rate Source: China Unionpay
  • 10. Challenges ❏ High concurrency ❏ > 50M transactions, 1 billion events a day (peek: 35K/s) ❏ Low latency demand ❏ response < 200ms ❏ Large number of batch jobs and streaming jobs
  • 11. “A merchant’s total transaction volume ($) within the past month (30days) (current transaction included)” = sum($past_29days) + sum($today_upto_current) batch streaming
  • 13. Batch Layer Speed/Streaming Layer Architecture V1 - Lambda API Gateway Serving Layer
  • 14. Drawbacks ❏ S/W stacks complexity ❏ Realtime / Offline / Serving stacks ❏ Multiple clusters to maintain (Kafka / Hive / Spark / Flink) ❏ Different skill sets to manipulate (Scala / Java / SQL) ❏ Segmented Logics ❏ Historical/Current ❏ Data redundancy ❏ Multiple duplications to move over
  • 16. What is Apache Pulsar?
  • 17. “Flexible Pub/Sub Messaging Backed by durable log storage”
  • 18. Pulsar - A cloud-native architecture Stateless Serving Durable Storage
  • 19. Pulsar - Segment Centric Storage ❏ Topic Partition (Managed Ledger) ❏ The storage layer for a single topic partition ❏ Segment (Ledger) ❏ Single writer, append-only ❏ Replicated to multiple bookies
  • 21. Pulsar - Topic Partitions
  • 24. Pulsar - Stream as a unified view on data
  • 25. Pulsar - Two levels of reading API ❏ Pub/Sub (Streaming) ❏ Read data from brokers ❏ Consume / Seek / Receive ❏ Subscription Mode - Failover, Shared, Key_Shared ❏ Reprocessing data by rewinding (seeking) the cursors ❏ Segment (Batch) ❏ Read data from storage (bookkeeper or tiered storage) ❏ Fine-grained Parallelism ❏ Predicate pushdown (publish timestamp)
  • 28. Architecture V2 API Gateway Spark Structured Streaming Spark SQL ❏ Single Data Store (Pulsar) ❏ Single Computing Engine (Spark) ❏ Unified API
  • 29. Pulsar-Spark ❏ Deeply integrated with Pulsar schema ❏ Pulsar topics as Structured Streams ❏ Pulsar Connectors for Spark Structured Streaming ❏ Pulsar Connectors for Spark SQL https://github.com/streamnative/pulsar-spark
  • 30. Pulsar-Spark / Streaming Queries https://github.com/streamnative/pulsar-spark
  • 31. Pulsar-Spark / Batch Queries https://github.com/streamnative/pulsar-spark
  • 32. Pulsar-Spark / Write Results to Pulsar https://github.com/streamnative/pulsar-spark
  • 33. PoC at Bestpay ❏ Ingest data to Pulsar ❏ Realtime Data ❏ pulsar-io-kafka: connect kafka messages (JSON) to Pulsar and store them in AVRO format with schema information ❏ Historic Data ❏ pulsar-spark: query the Hive table and insert Hive rows as Pulsar messages (AVRO) to Pulsar ❏ Data Processing ❏ Spark Structured Streaming: for stream processing ❏ Spark SQL: for batch processing and interactive queries
  • 34. Benefits ❏ Complexity drop 33% (Number of clusters from 6 down to 4) ❏ Storage saving 8.7% (expect to be 28%) ❏ Time to production boosts 11x (backed with streaming SQL) ❏ Higher stability (expected)
  • 35. Summary ❏ Apache Pulsar is a cloud-native messaging streaming system ❏ Multi layered architecture ❏ Segment centric storage ❏ Two levels of reading API: Pub/Sub + Segment ❏ Apache Pulsar provides a unified view of data ❏ Pulsar + Spark for a simple unified data processing
  • 36. References ❏ pulsar-io-kafka: https://github.com/streamnative/pulsar-io-kafka ❏ pulsar-spark: https://github.com/streamnative/pulsar-spark ❏ Apache Pulsar as One Storage System for Both Real-time and Historical Data Analysis: https://medium.com/streamnative/apache-pulsar-as-one-storage-455222c590 17
  • 37. Community ❏ Pulsar Website: https://pulsar.apache.org ❏ Twitter: @apache_pulsar / @streamnativeio ❏ Slack: https://apache-pulsar.herokuapp.com ❏ Mailing Lists dev@pulsar.apache.org, users@pulsar.apache.org ❏ Github https://github.com/apache/pulsar ❏ Medium https://medium.com/streamnative