SlideShare a Scribd company logo
Feature Stores
Building ML Infrastructure on Apache Pulsar
Simba Khadder
Co-Founder & CEO
StreamSQL.io
Using Apache Pulsar to power our feature store for
>100m MAU
Agenda
● The ML process
● Moving our ML Pipelines w/ Pulsar
● Building a Feature Store on top of Pulsar
● Q&A
Last 5 articles read
Current article
Top Genre
Average Content Length
Diversity of reading tastes
....
Recommend Next
Article
Input Features Model Output
Machine Learning :: Model(Features) = Output
Feature Engineering > Model Research*
Last 5 articles read
Current article
Top Category
Total time spent reading
Diversity of reading tastes
....
Recommend Next
Article
Input Features Model Output
Behind every great model is a set of great features
Feature Stores: Building Machine Learning Infrastructure on Apache Pulsar_Simba khadder
Credit: Microsoft Azure Sales deck
Our ML teams spent >80% of their time
building and maintaining ML pipelines for
feature generation and feature
engineering.
Deploy Feature to
Production
Validate New Feature
Increases Performance
Generate Training Dataset
with new Feature
Hypothesis
New Feature
The Feature Engineering Cycle
Training Data
Online Features Serving
Train
User ID
Feature Set
Arr([FeatureSet, Actual])
Last 5 articles read
Current article
Top Category
Total time spent reading
Diversity of reading tastes
....
Recommend Next
Article
Input Features Model Output
Generating Point-in-Time Correct Training Data
Time
read
Features at
timestamp
read
Features at
timestamp
read
Features at
timestamp
Events Storage Training Data
Generating Features for Serving (in a perfect world)
Time
read
Features at
timestamp
read
Features at
timestamp
read
Features at
timestamp
Event Stream Online FeaturesProcessor
Last 5 articles read
Current article
Top Category
Total time spent reading
Diversity of reading tastes
....
Recommend Next
Article
Input Features Model Output
Most Features are Stateful
Total time spent reading
Input Features Model Output
Stateful Features must be Bootstrapped
Bootstrapping Stateful Features with
Historical Data in S3
SELECT user, SUM(readtime) FROM read_events GROUP BY user;
Time
Persisted in S3 Not in S3, but in Kafka
retention period
MsgID
Finish bootstrapping & start
stream processing from Kafka
SELECT user, SUM(readtime) FROM read_events GROUP BY user;
Full Feature Deployment Process
Feature Stores: Building Machine Learning Infrastructure on Apache Pulsar_Simba khadder
Combine Batch & Stream Processing with an Immutable Ledger
● Each new event appends to the end of the ledger
● Cut at an arbitrary point, and the ledger looks like a
batch problem
● Only read from the head of the ledger and it looks
like streaming problem
Pulsar Based Architecture with Infinite Retention
Pulsar’s offloading makes Event-Sourcing achievable
Pulsar’s Tiered Architecture enhances Processing on
Infinite Retention
Feature are the building blocks of
ML models; however, they are
developed and maintained in
ad-hoc ways.
They lack a dedicated system of
management.
ML Pipelines < Feature Stores
● No concrete feature definitions, feature logic is split
across Flink jobs.
● No feature versioning and rollback.
● No feature sharing, re-use, and discovery.
● No integrations into Tensorflow, Jupyter, etc.
A Platform for features allows for teams to work together.
Features are easily defined, shared, and re-used.
There exists a single source of truth for features.
Last 5 articles read
Current article
Top Category
Total time spent reading
Diversity of reading tastes
....
Recommend Next
Article
Input Features Model Output
Models across an organization may benefit from
some of these features.
StreamSQL.io
accelerates and
enhances machine
learning development
● Facilitate model development
Discover, re-use, and share
features across teams and models.
● Deploy with confidence
Use a single feature definition for
training and serving.
● Limit complexity
Unified streaming and batch
processing for feature generation.
● Increase model performance
Use 3rd party features from text
embeddings to weather data.
Fraud Detection Example
1. Connect and Upload Data
2. Transform and Join Data
3. Define and Serve Features
4. Generate Training Data
Time
label
Features at
timestamp
label
Features at
timestamp
label
Features at
timestamp
Deploy Feature to
Production
Validate New Feature
Increases Performance
Generate Training Dataset
with new Feature
Hypothesis
New Feature
The Feature Engineering Cycle
StreamSQL
The Feature Store for Machine Learning
Beta
Simba Khadder
simba@streamsql.io
Feature Stores: Building Machine Learning Infrastructure on Apache Pulsar_Simba khadder

More Related Content

Feature Stores: Building Machine Learning Infrastructure on Apache Pulsar_Simba khadder