SlideShare a Scribd company logo
Building Identity Graphs over Heterogeneous Data
Building Identity Graphs over
Heterogeneous Data
Sudha Viswanathan
Saigopal Thota
Primary Contributors
Agenda
▪ Identities at Scale
▪ Why Graph
▪ How we Built and Scaled it
▪ Why an In-house solution
▪ Challenges and
Considerations
▪ A peek into Real time Graph
Identities at Scale
Identity
Tokens
Online 2
...
Partner
Apps
App 3App 2
App1
Online 1
Account Ids
Cookies
Online Ids
Device Ids
Identity Resolution
Aims to provide a coherent view of a
customer and / or a household by unifying
all customer identities across channels and
subsidiaries
Provides single notion of a customer
Why Graph
Identities, Linkages & Metadata – An Example
Device
id
App id
Login id App id
Cookie
id
Device
id
Last login: 4/28/2020
App: YouTube
Country: Canada
Identity
Linkage
Country: Canada Metadata
Graph – An Example
Login id App id
Last login: 4/28/2020
Device
id
App:YouTube
Cookie
id Country: Canada
Connect all Linkages to create a single connected component per user/household
Graph Traversal
▪ Graph is an efficient data
structure relative to table joins
▪ Why Table join doesn't work?
▪ Linkages are in the order of millions of rows
spanning across hundreds of tables
▪ Table joins are based on
index and are computationally very expensive
▪ Table joins result in lesser coverage
Scalable and offers better coverage
Build once – Query multiple times
▪ Graph enables dynamic traversal logic. One Graph offers infinite
traversal possibilities
▪ Get all tokens liked to an entity
▪ Get all tokens linked to the entity's household
▪ Get all tokens linked to an entity that created after Jan 2020
▪ Get all tokens linked to the entity's household that interacted using App 1
▪ ...
Graph comes with flexibility in traversal
How we Built and Scaled
Scale and Performance Objectives
▪ More than 25+ Billion linkages and identities
▪ New linkages created 24x7
▪ Node and Edge Metadata updated for 60% of existing Linkages
▪ Freshness – Graph Updated with linkages and Metadata, Once a day
▪ Could be few hours, in future goals
▪ Ability to run on general purpose Hadoop Infrastructure
Components of Identity Graph
Data Analysis
Understand your data
and check for
anomalies
Handling
Heterogenous Data
Sources
Extract only new and
modified linkages in a
format needed by the
next stage
Stage I – Dedup
& Eliminate
outliers
Add edge metadata,
filter outliers and
populate tables needed
by the next stage
Stage II – Create
Connected
Components
Merge Linkages to form
an Identity graph for
each customer
Stage III –
Prepare for
Traversal
Demystifies linkages
within a cluster and
appends metadata
information to enable
graph traversal
Traversal
Traverse across the
cluster as per defined
rules to pick only the
qualified nodes
Core Processing
Data Analysis
▪ Understanding the data that feeds into Graph pipeline is paramount to
building a usable Graph framework.
▪ Feeding poor quality linkage results in connected components spanning across millions of nodes, taking a toll on computing
resources and business value
▪ Some questions to analyze,
▪ Does the linkage relationship makes business sense?
▪ What is acceptable threshold for poor quality linkages
▪ Do we need to apply any filter
▪ Nature of data – Snapshot vs Incremental
Handling Heterogenous Data Sources
▪ Data sources grow rapidly in volume and variety
▪ From a handful of manageable data streams to an intimidatingly magnificent Niagara falls!
▪ Dedicated framework to ingest data in parallel
from heterogenous sources
▪ Serves only new and modified linkages. This is important for Incremental processing
▪ Pulls only the desired attributes for further processing – linkages and their metadata in
a standard schema
Core Processing – Stage I
▪ Feeds good quality linkages to further
processing.
▪ It handles:
▪ Deduplication
▪ If a linkage is repeated, we consume only the latest record
▪ Outlier elimination
▪ Filters anomalous linkages based on a chosen threshold derived from data analysis
▪ Edge Metadata population
▪ Attributes of the linkage. It helps to traverse the graph to get desired linkages.
Dedup & Eliminate outliers
Core Processing – Stage II
▪ Merges all related linkages of a customer to create a Connected
Component
Create Connected Components
Core Processing – Stage III
▪ This stage enriches the connected component with linkages between nodes
and edge metadata to enable graph traversal.
Prepare for Traversal
Login id App id
Last login: 4/28/2020
Device
id
App:YouTube
Cookie
id Country: Canada
Login id
App id
Device
id
Cookie
id
A B
B C
B D
D E
A
B
DE
G1
A
B
D
E
C
m
1
m
2
m3
m4
Stage II – Create Connected Components Stage III - Prepare for Traversal
C
PN
NM
Stage I – Dedup & Outlier Elimination
P
NM
N
M P
G2 m1 m2
Union Find Shuffle (UFS): Building
Connected Components at Scale
Weighted Union Find with Path Compression
2
5
2
9
5
9
2
2
9
2
9
Top Level parent -
2 9
Size of the cluster - 2
2
5
Top Level parent -
Size of the cluster - 1
5
5
Height – 2
(not Weighted Union)
Height -1
Weighted Union
or
7 8
7 8
5 7 Top Level parent -
Size of the cluster - 3
2 Top Level parent -
Size of the cluster - 2
7
2
9 5
8
7
2
9 5
8 1
2
9 5
8
7
1
Top Level parent -
Size of the cluster - 1
1
2
9 5 87 1
Path Compression
Top Level parent -
Size of the cluster - 5
2
• Find() – helps to find the top level parent. If a is the child of b and b is the child of c, then, find()
determines that c is the top level parent;
a -> b; b -> c => c is top parent
• Path compression() – helps to reduce height of connected components by linking all children directly to
the top level parent.
a -> b; b -> c => a -> b -> c => a -> c; b -> c
• Weighted Union() – Unifies top level parents. Parent with lesser children is made the child of the parent
with more children. This also helps to reduce the height of connected components.
Distributed UFS with Path Compression
Path Compress Iteratively perform Path Compression for connected components until
all connected components are path compressed.
Shuffle Merge locally processed partitions with a global shuffle iteratively until
all connected components are resolved
Run UF Run Weighted Union Find with Path Compression on each partition
Divide Divide the data into partitions
Shuffle in UFS
9 4 5 8 3
Reached Termination conditionProceeds to next iteration
3 74 9
8 6
6 3
32
97
57
9
4 3
6 8 34 32
2 3
3 7
5
9 7 5
6 37 3 5
Union Find Shuffle using Spark
▪ Sheer scale of data at hand ( 25+ Billion vertices & 30+ Billion edges)
▪ Iterative processing with caching and intermittent checkpointing
▪ Limitations with other alternatives
How do we scale?
▪ The input to Union Find Shuffle is bucked to create 1000 part files of
similar size
▪ 10 Instances of Union Find executes on 1000 part files with ~30 billion
nodes. Each instance of UF is applied to 100 part files.
▪ At any given time, we will have 5 instances of UF running in parallel
Data Quality:
Challenges and Considerations
Noisy Data
Coo
1
Acc
1
Coo
2 Coo
3
Coo
4
Coo
100
Cookie Tokens
Acc
1
Coo.
1
Acc
2 Acc
3
Acc
4
Acc
100
•
•
•
•
Cookie Token
Graph exposes Noise, opportunities, and fragmentation in data
•
•
•
•
An Example of anomalous linkage data
▪ For some linkages, we have millions of
entities mapping to the same token (id)
▪ In the data distribution, we see a
majority of tokens mapped to 1-5 entities
▪ We also see a few tokens (potential
outliers) mapped to millions of entities!
Data Distribution
Building Identity Graphs over Heterogeneous Data
Removal of Anomalous Linkages
▪ Extensive analysis to identify anomalous linkage patterns
▪ A Gaussian Anomaly detection model (Statistical Analysis)
▪ Identify thresholds of linkage cardinality to filter linkages
▪ A lenient threshold will improve coverage at the cost of precision.
Hit the balance between Coverage and Precision
Threshold # Big Clusters % match of Facets 1 / # of distinct Entities
Threshold 10 – High Precision, Low Coverage
• Majority of connected components are not big clusters; So, % of distinct entities outside the big cluster(s) will be high
• More linkages would have been filtered out as part of dirty linkages; So, % match of facets will suffer
Threshold 1000 – Low Precision, High Coverage
• More connected components form big clusters; So, % of distinct entities outside the big clusters will be lesser
• Only a few linkages would have been filtered out as part of dirty linkages; S0, % match of facets will be high
Threshold 10; Big Cluster(s) 1 Threshold 1000; Big Cluster(s) 4
Large Connected Components (LCC)
▪ Size ranging from 10k – 100 M +
▪ Combination of Hubs and Long Chains
▪ Token collisions, Noise in data, Bot
traffic
▪ Legitimate users belong to LCC
▪ Large number of shuffles in UFS
A result of lenient threshold
Traversing LCC
▪ Business demands both Precision and Coverage, hence LCC needs
traversal
▪ Iterative Spark BFS implementation is used to traverse LCC
▪ Traversal is supported up to a certain pre-defined depth
▪ Going beyond a certain depth not only strains the system but also adds no business value
▪ Traversal is optimized to run using Spark in 20-30 minutes over all
connected components
Solution to get both Precision and Coverage
Data Volume & Runtime
tid1 tid2 Linkage
metadata
a b tid1 tid
2
c
a b tid1 tid
2
c
a b tid1 tid
2
c
Graph Pipeline – Powered by
Handling Heterogenous Linkages Stage I Stage II
Stage III - LCC
Stage III - SCC
a b tid1 tid
2
c
a b tid1 tid2 c
15 upstream tables
p q tid6 tid9 r
25B+ Raw Linkages &
30B+ Nodes
tid1 tid2 Linkage
metadata
tid1 tid2 Linkage
metadata
tid1_long tid2_long Linkage
metadata
tid6_long tgid120
tid1_long tgid1
tgid Linkages with Metadata
tgid1 {tgid: 1, tid: [aid,bid],
edges:[srcid,
destid,metadata],[] }
1-2
hrs
UnionFindShuffle
8-10hrs
1-2
hrs
Subgraphcreation
4-5hrs
tgid tid Linkages
(adj_list)
3 A1 [C1:m1, B1:m2, B2:m3]
2 C1 [A1:m1]
2 A2 B2:m2
3 B1 [A1:m2]
3 B2 [A1:m3]
tid Linkages
A1 [C1:m1, B1:m2]
A2 [B2:m2]
C1 [A1:m1]
B1 [A1:m2]
Give all aid-bid linkages which go via
cid
Traversal request
Give all A– B linkages where
criteria= m1,m2
Traversal request on LCC
Filter tids on
m1,m2
Select
count(*)
by tgid
MR on
filtered
tgid
partitions
Dump
LCC table
> 5k
CC
startnode=A, endnode=B, criteria=m1,m2
tid Linkage
A1 B1
A2 B2
For each tid do
a bfs
(unidirected/bidi
rected)
Map
Map
Map
1 map per tgid
traversal
tid1 tid1_long
tid6 tid6_long
tid6 tgid120
tid1 tgid1
Tableextraction&
transformation30mins
20-30
mins
2.5
hrs
30
mins
20-30
mins
A peek into Real time Graph
▪ Linkages within streaming datasets
▪ New linkages require updating the graph in real time.
▪ Concurrency – Concurrent updates to graphs needs to be handled to avoid deadlocks, starvation, etc.
▪ Scale
▪ High-volume - e.g., Clickstream data - As users browse the webpage/app, new events get generated
▪ Replication and Consistency – Making sure that the data is properly replicated for fault-tolerance, and is consistent for queries
▪ Real-time Querying and Traversals
▪ High throughput traversing and querying capability on tokens belonging to the same customer
Real time Graph: Challenges
Questions?

More Related Content

Building Identity Graphs over Heterogeneous Data

  • 2. Building Identity Graphs over Heterogeneous Data Sudha Viswanathan Saigopal Thota
  • 4. Agenda ▪ Identities at Scale ▪ Why Graph ▪ How we Built and Scaled it ▪ Why an In-house solution ▪ Challenges and Considerations ▪ A peek into Real time Graph
  • 6. Identity Tokens Online 2 ... Partner Apps App 3App 2 App1 Online 1 Account Ids Cookies Online Ids Device Ids
  • 7. Identity Resolution Aims to provide a coherent view of a customer and / or a household by unifying all customer identities across channels and subsidiaries Provides single notion of a customer
  • 9. Identities, Linkages & Metadata – An Example Device id App id Login id App id Cookie id Device id Last login: 4/28/2020 App: YouTube Country: Canada Identity Linkage Country: Canada Metadata
  • 10. Graph – An Example Login id App id Last login: 4/28/2020 Device id App:YouTube Cookie id Country: Canada Connect all Linkages to create a single connected component per user/household
  • 11. Graph Traversal ▪ Graph is an efficient data structure relative to table joins ▪ Why Table join doesn't work? ▪ Linkages are in the order of millions of rows spanning across hundreds of tables ▪ Table joins are based on index and are computationally very expensive ▪ Table joins result in lesser coverage Scalable and offers better coverage
  • 12. Build once – Query multiple times ▪ Graph enables dynamic traversal logic. One Graph offers infinite traversal possibilities ▪ Get all tokens liked to an entity ▪ Get all tokens linked to the entity's household ▪ Get all tokens linked to an entity that created after Jan 2020 ▪ Get all tokens linked to the entity's household that interacted using App 1 ▪ ... Graph comes with flexibility in traversal
  • 13. How we Built and Scaled
  • 14. Scale and Performance Objectives ▪ More than 25+ Billion linkages and identities ▪ New linkages created 24x7 ▪ Node and Edge Metadata updated for 60% of existing Linkages ▪ Freshness – Graph Updated with linkages and Metadata, Once a day ▪ Could be few hours, in future goals ▪ Ability to run on general purpose Hadoop Infrastructure
  • 15. Components of Identity Graph Data Analysis Understand your data and check for anomalies Handling Heterogenous Data Sources Extract only new and modified linkages in a format needed by the next stage Stage I – Dedup & Eliminate outliers Add edge metadata, filter outliers and populate tables needed by the next stage Stage II – Create Connected Components Merge Linkages to form an Identity graph for each customer Stage III – Prepare for Traversal Demystifies linkages within a cluster and appends metadata information to enable graph traversal Traversal Traverse across the cluster as per defined rules to pick only the qualified nodes Core Processing
  • 16. Data Analysis ▪ Understanding the data that feeds into Graph pipeline is paramount to building a usable Graph framework. ▪ Feeding poor quality linkage results in connected components spanning across millions of nodes, taking a toll on computing resources and business value ▪ Some questions to analyze, ▪ Does the linkage relationship makes business sense? ▪ What is acceptable threshold for poor quality linkages ▪ Do we need to apply any filter ▪ Nature of data – Snapshot vs Incremental
  • 17. Handling Heterogenous Data Sources ▪ Data sources grow rapidly in volume and variety ▪ From a handful of manageable data streams to an intimidatingly magnificent Niagara falls! ▪ Dedicated framework to ingest data in parallel from heterogenous sources ▪ Serves only new and modified linkages. This is important for Incremental processing ▪ Pulls only the desired attributes for further processing – linkages and their metadata in a standard schema
  • 18. Core Processing – Stage I ▪ Feeds good quality linkages to further processing. ▪ It handles: ▪ Deduplication ▪ If a linkage is repeated, we consume only the latest record ▪ Outlier elimination ▪ Filters anomalous linkages based on a chosen threshold derived from data analysis ▪ Edge Metadata population ▪ Attributes of the linkage. It helps to traverse the graph to get desired linkages. Dedup & Eliminate outliers
  • 19. Core Processing – Stage II ▪ Merges all related linkages of a customer to create a Connected Component Create Connected Components
  • 20. Core Processing – Stage III ▪ This stage enriches the connected component with linkages between nodes and edge metadata to enable graph traversal. Prepare for Traversal Login id App id Last login: 4/28/2020 Device id App:YouTube Cookie id Country: Canada Login id App id Device id Cookie id
  • 21. A B B C B D D E A B DE G1 A B D E C m 1 m 2 m3 m4 Stage II – Create Connected Components Stage III - Prepare for Traversal C PN NM Stage I – Dedup & Outlier Elimination P NM N M P G2 m1 m2
  • 22. Union Find Shuffle (UFS): Building Connected Components at Scale
  • 23. Weighted Union Find with Path Compression 2 5 2 9 5 9 2 2 9 2 9 Top Level parent - 2 9 Size of the cluster - 2 2 5 Top Level parent - Size of the cluster - 1 5 5 Height – 2 (not Weighted Union) Height -1 Weighted Union or 7 8 7 8 5 7 Top Level parent - Size of the cluster - 3 2 Top Level parent - Size of the cluster - 2 7 2 9 5 8 7 2 9 5 8 1 2 9 5 8 7 1 Top Level parent - Size of the cluster - 1 1 2 9 5 87 1 Path Compression Top Level parent - Size of the cluster - 5 2
  • 24. • Find() – helps to find the top level parent. If a is the child of b and b is the child of c, then, find() determines that c is the top level parent; a -> b; b -> c => c is top parent • Path compression() – helps to reduce height of connected components by linking all children directly to the top level parent. a -> b; b -> c => a -> b -> c => a -> c; b -> c • Weighted Union() – Unifies top level parents. Parent with lesser children is made the child of the parent with more children. This also helps to reduce the height of connected components.
  • 25. Distributed UFS with Path Compression Path Compress Iteratively perform Path Compression for connected components until all connected components are path compressed. Shuffle Merge locally processed partitions with a global shuffle iteratively until all connected components are resolved Run UF Run Weighted Union Find with Path Compression on each partition Divide Divide the data into partitions
  • 26. Shuffle in UFS 9 4 5 8 3 Reached Termination conditionProceeds to next iteration 3 74 9 8 6 6 3 32 97 57 9 4 3 6 8 34 32 2 3 3 7 5 9 7 5 6 37 3 5
  • 27. Union Find Shuffle using Spark ▪ Sheer scale of data at hand ( 25+ Billion vertices & 30+ Billion edges) ▪ Iterative processing with caching and intermittent checkpointing ▪ Limitations with other alternatives
  • 28. How do we scale? ▪ The input to Union Find Shuffle is bucked to create 1000 part files of similar size ▪ 10 Instances of Union Find executes on 1000 part files with ~30 billion nodes. Each instance of UF is applied to 100 part files. ▪ At any given time, we will have 5 instances of UF running in parallel
  • 30. Noisy Data Coo 1 Acc 1 Coo 2 Coo 3 Coo 4 Coo 100 Cookie Tokens Acc 1 Coo. 1 Acc 2 Acc 3 Acc 4 Acc 100 • • • • Cookie Token Graph exposes Noise, opportunities, and fragmentation in data • • • •
  • 31. An Example of anomalous linkage data ▪ For some linkages, we have millions of entities mapping to the same token (id) ▪ In the data distribution, we see a majority of tokens mapped to 1-5 entities ▪ We also see a few tokens (potential outliers) mapped to millions of entities! Data Distribution
  • 33. Removal of Anomalous Linkages ▪ Extensive analysis to identify anomalous linkage patterns ▪ A Gaussian Anomaly detection model (Statistical Analysis) ▪ Identify thresholds of linkage cardinality to filter linkages ▪ A lenient threshold will improve coverage at the cost of precision. Hit the balance between Coverage and Precision
  • 34. Threshold # Big Clusters % match of Facets 1 / # of distinct Entities Threshold 10 – High Precision, Low Coverage • Majority of connected components are not big clusters; So, % of distinct entities outside the big cluster(s) will be high • More linkages would have been filtered out as part of dirty linkages; So, % match of facets will suffer Threshold 1000 – Low Precision, High Coverage • More connected components form big clusters; So, % of distinct entities outside the big clusters will be lesser • Only a few linkages would have been filtered out as part of dirty linkages; S0, % match of facets will be high Threshold 10; Big Cluster(s) 1 Threshold 1000; Big Cluster(s) 4
  • 35. Large Connected Components (LCC) ▪ Size ranging from 10k – 100 M + ▪ Combination of Hubs and Long Chains ▪ Token collisions, Noise in data, Bot traffic ▪ Legitimate users belong to LCC ▪ Large number of shuffles in UFS A result of lenient threshold
  • 36. Traversing LCC ▪ Business demands both Precision and Coverage, hence LCC needs traversal ▪ Iterative Spark BFS implementation is used to traverse LCC ▪ Traversal is supported up to a certain pre-defined depth ▪ Going beyond a certain depth not only strains the system but also adds no business value ▪ Traversal is optimized to run using Spark in 20-30 minutes over all connected components Solution to get both Precision and Coverage
  • 37. Data Volume & Runtime
  • 38. tid1 tid2 Linkage metadata a b tid1 tid 2 c a b tid1 tid 2 c a b tid1 tid 2 c Graph Pipeline – Powered by Handling Heterogenous Linkages Stage I Stage II Stage III - LCC Stage III - SCC a b tid1 tid 2 c a b tid1 tid2 c 15 upstream tables p q tid6 tid9 r 25B+ Raw Linkages & 30B+ Nodes tid1 tid2 Linkage metadata tid1 tid2 Linkage metadata tid1_long tid2_long Linkage metadata tid6_long tgid120 tid1_long tgid1 tgid Linkages with Metadata tgid1 {tgid: 1, tid: [aid,bid], edges:[srcid, destid,metadata],[] } 1-2 hrs UnionFindShuffle 8-10hrs 1-2 hrs Subgraphcreation 4-5hrs tgid tid Linkages (adj_list) 3 A1 [C1:m1, B1:m2, B2:m3] 2 C1 [A1:m1] 2 A2 B2:m2 3 B1 [A1:m2] 3 B2 [A1:m3] tid Linkages A1 [C1:m1, B1:m2] A2 [B2:m2] C1 [A1:m1] B1 [A1:m2] Give all aid-bid linkages which go via cid Traversal request Give all A– B linkages where criteria= m1,m2 Traversal request on LCC Filter tids on m1,m2 Select count(*) by tgid MR on filtered tgid partitions Dump LCC table > 5k CC startnode=A, endnode=B, criteria=m1,m2 tid Linkage A1 B1 A2 B2 For each tid do a bfs (unidirected/bidi rected) Map Map Map 1 map per tgid traversal tid1 tid1_long tid6 tid6_long tid6 tgid120 tid1 tgid1 Tableextraction& transformation30mins 20-30 mins 2.5 hrs 30 mins 20-30 mins
  • 39. A peek into Real time Graph
  • 40. ▪ Linkages within streaming datasets ▪ New linkages require updating the graph in real time. ▪ Concurrency – Concurrent updates to graphs needs to be handled to avoid deadlocks, starvation, etc. ▪ Scale ▪ High-volume - e.g., Clickstream data - As users browse the webpage/app, new events get generated ▪ Replication and Consistency – Making sure that the data is properly replicated for fault-tolerance, and is consistent for queries ▪ Real-time Querying and Traversals ▪ High throughput traversing and querying capability on tokens belonging to the same customer Real time Graph: Challenges