HBase HUG Presentation: Avoiding Full GCs with MemStore-Local Allocation Buffers

Avoiding Full GCs with
MemStore-Local Allocation Buﬀers
Todd Lipcon
todd@cloudera.com
Twitter: @tlipcon #hbase IRC: tlipcon

February 22, 2011

Outline

Background

HBase and GC

A solution

Summary

Intro / who am I?
Been working on data stuﬀ for a few years
HBase, HDFS, MR committer
Cloudera engineer since March ’09

Motivation
HBase users want to use large heaps
Bigger block caches make for better hit rates
Bigger memstores make for larger and more
eﬃcient ﬂushes
Machines come with 24G-48G RAM
But bigger heaps mean longer GC pauses
Around 10 seconds/GB on my boxes.
Several minute GC pauses wreak havoc

GC Disasters
1. Client requests stalled
1 minute “latency” is just as bad as unavailability
2. ZooKeeper sessions stop pinging
The dreaded “Juliet Pause” scenario
3. Triggers all kinds of other nasty bugs

Yo Concurrent
Mark-and-Sweep (CMS)!

What part of Concurrent didn’t
you understand?

Java GC Background
Java’s GC is generational
Generational hypothesis: most objects either die
young or stick around for quite a long time
Split the heap into two “generations” - young (aka
new) and old (aka tenured)
Use diﬀerent algorithms for the two generations
We usually recommend -XX:+UseParNewGC
-XX:+UseConcMarkSweepGC
Young generation: Parallel New collector
Old generation: Concurrent-mark-sweep

The Parallel New collector in 60 seconds
Divide the young generation into eden,
survivor-0, and survivor-1
One survivor space is from-space and the other
is to-space
Allocate all objects in eden
When eden ﬁlls up, stop the world and copy
live objects from eden and from-space into
to-space, swap from and to
Once an object has been copied back and forth N
times, copy it to the old generation
N is the “Tenuring Threshold” (tunable)

The CMS collector in 60 seconds
A bit simpliﬁed, sorry...
Several phases:
1. initial-mark (stop-the-world) - marks roots (eg
thread stacks)
2. concurrent-mark - traverse references starting at
roots, marking what’s live
3. concurrent-preclean - another pass of the same
(catch new objects)
4. remark (stop-the-world) - any last changed/new
objects
5. concurrent-sweep - clean up dead objects to
update free space tracking
Note: dead objects free up space, but it’s not
contiguous. We’ll come back to this later!

CMS failure modes
1. When young generation collection happens, it
needs space in the old gen. What if CMS is
already in the middle of concurrent work, but
there’s no space?
The dreaded concurrent mode failure! Stop
the world and collect.
Solution: lower value of
-XX:CMSInitiatingOccupancyFraction so
CMS starts working earlier
2. What if there’s space in the old generation, but
not enough contiguous space to promote a
large object?
We need to compact the old generation (move all
free space to be contiguous)
This is also stop-the-world! Kaboom!

OK... so life sucks.

What can we do about it?

Step 1. Hypothesize
Setting the initiating occupancy fraction low
puts oﬀ GC, but it eventually happens no
matter what
We see promotion failed followed by long
GC pause, even when 30% of the heap is free.
Why? Must be fragmentation!

Step 2. Measure
Let’s make some graphs:
-XX:PrintFLSStatistics=1
-XX:PrintCMSStatistics=1
-XX:+PrintGCDetails
-XX:+PrintGCDateStamps -verbose:gc
-Xloggc:/.../logs/gc-$(hostname).log
FLS Statistics: verbose information about the
state of the free space inside the old generation

Free space - total amount of free space
Num blocks - number of fragments it’s spread into
Max chunk size
parse-fls-statistics.py → R and
ggplot2

Workload 2
Read-only with cache churn

Workload 3
Read-only with no cache churn

So boring I didn’t make a graph!
All allocations are short lived → stay in young
gen

Recap
What we have learned?

Fragmentation is what causes long GC pauses
Write load seems to cause fragmentation
Read load (LRU cache churn) isn’t nearly so
bad1

1
At least for my test workloads

Taking a step back
Why does write load cause fragmentation?

Imagine we have 5 regions, A through E
We take writes in the following order into an
empty old generation:

Taking a step back

A

Taking a step back

AB

Taking a step back

ABC

Taking a step back

ABCD

Taking a step back

ABCDE

Taking a step back

ABCDEABCEDDAECBACEBCED

Taking a step back

Now B’s memstore ﬁlls up and ﬂushes. We’re
left with:

Taking a step back

left with:
A CDEA CEDDAEC ACE CED

Taking a step back

left with:
A CDEA CEDDAEC ACE CED
Looks like fragmentation!

Also known as swiss cheese

If every write is exactly the same size, it’s ﬁne -
we’ll ﬁll in those holes. But this is seldom true.

A solution
Crucial issue is that memory allocations for a
given memstore aren’t next to each other in
the old generation.
When we free an entire memstore we only get
tiny blocks of free space
What if we ensure that the memory for a
memstore is made of large blocks?
Enter the MemStore Local Allocation Buﬀer
(MSLAB)

What’s an MSLAB?
Each MemStore has an instance of
MemStoreLAB.
MemStoreLAB has a 2MB curChunk with
nextFreeOffset starting at 0.
Before inserting a KeyValue that points to
some byte[], copy the data into curChunk
and increment nextFreeOﬀset by data.length
Insert a KeyValue pointing inside curChunk
instead of the original data.
If a chunk ﬁlls up, just make a new one.
This is all lock-free, using atomic
compare-and-swap instructions.

How does this help?
The original data to be inserted becomes very
short-lived, and dies in the young generation.
The only data in the old generation is made of
2MB chunks
Each chunk only belongs to one memstore.
When we flush, we always free up 2MB chunks,
and avoid the swiss cheese effect.
Next time we allocate, we need exactly 2MB
chunks again, and there will definitely be space.

It works!

Have seen basically zero full
GCs with MSLAB enabled,
after days of load testing

Summary
Most GC pauses are caused by fragmentation
in the old generation.
The CMS collector doesn’t compact, so the
only way it can ﬁght fragmentation is to pause.
The MSLAB moves all MemStore allocations
into contiguous 2MB chunks in the old
generation.
No more GC pauses!

How to try it
1. Upgrade to HBase 0.90.1 (included in
CDH3b4)
2. Set hbase.hregion.memstore.mslab.enabled to
true
Also tunable:
hbase.hregion.memstore.mslab.chunksize
(in bytes, default 2M)

hbase.hregion.memstore.mslab.max.allocation
(in bytes, default 256K)
3. Report back your results!

Future work
Flat 2MB chunk per region → 2GB RAM
minimum usage for 1000 regions
incrementColumnValue currently bypasses
MSLAB for subtle reasons
We’re doing an extra memory copy into
MSLAB chunk - we can optimize this out
Maybe we can relax
CMSInitiatingOccupancyFraction back up
a bit?

So I don’t forget...
Corporate shill time

Cloudera oﬀering HBase training on March 10th.

15 percent oﬀ with hbase meetup code.

todd@cloudera.com
Twitter: @tlipcon
#hbase IRC: tlipcon

P.S. we’re hiring!

HBase HUG Presentation: Avoiding Full GCs with MemStore-Local Allocation Buffers

More Related Content

HBase HUG Presentation: Avoiding Full GCs with MemStore-Local Allocation Buffers