SlideShare a Scribd company logo
Avoiding Full GCs with
MemStore-Local Allocation Buffers
                 Todd Lipcon
              todd@cloudera.com
Twitter: @tlipcon      #hbase IRC: tlipcon




            February 22, 2011
Outline

  Background

  HBase and GC

  A solution

  Summary
Intro / who am I?
     Been working on data stuff for a few years
     HBase, HDFS, MR committer
     Cloudera engineer since March ’09
Motivation
     HBase users want to use large heaps
         Bigger block caches make for better hit rates
         Bigger memstores make for larger and more
         efficient flushes
         Machines come with 24G-48G RAM
     But bigger heaps mean longer GC pauses
         Around 10 seconds/GB on my boxes.
         Several minute GC pauses wreak havoc
GC Disasters
   1. Client requests stalled
           1 minute “latency” is just as bad as unavailability
   2. ZooKeeper sessions stop pinging
           The dreaded “Juliet Pause” scenario
   3. Triggers all kinds of other nasty bugs
Yo Concurrent
   Mark-and-Sweep (CMS)!

What part of Concurrent didn’t
      you understand?
Java GC Background
     Java’s GC is generational
         Generational hypothesis: most objects either die
         young or stick around for quite a long time
         Split the heap into two “generations” - young (aka
         new) and old (aka tenured)
     Use different algorithms for the two generations
     We usually recommend -XX:+UseParNewGC
     -XX:+UseConcMarkSweepGC
         Young generation: Parallel New collector
         Old generation: Concurrent-mark-sweep
The Parallel New collector in 60 seconds
     Divide the young generation into eden,
     survivor-0, and survivor-1
     One survivor space is from-space and the other
     is to-space
     Allocate all objects in eden
     When eden fills up, stop the world and copy
     live objects from eden and from-space into
     to-space, swap from and to
         Once an object has been copied back and forth N
         times, copy it to the old generation
         N is the “Tenuring Threshold” (tunable)
The CMS collector in 60 seconds
A bit simplified, sorry...
            Several phases:
               1. initial-mark (stop-the-world) - marks roots (eg
                  thread stacks)
               2. concurrent-mark - traverse references starting at
                  roots, marking what’s live
               3. concurrent-preclean - another pass of the same
                  (catch new objects)
               4. remark (stop-the-world) - any last changed/new
                  objects
               5. concurrent-sweep - clean up dead objects to
                  update free space tracking
            Note: dead objects free up space, but it’s not
            contiguous. We’ll come back to this later!
CMS failure modes
   1. When young generation collection happens, it
      needs space in the old gen. What if CMS is
      already in the middle of concurrent work, but
      there’s no space?
          The dreaded concurrent mode failure! Stop
          the world and collect.
          Solution: lower value of
          -XX:CMSInitiatingOccupancyFraction so
          CMS starts working earlier
   2. What if there’s space in the old generation, but
      not enough contiguous space to promote a
      large object?
          We need to compact the old generation (move all
          free space to be contiguous)
          This is also stop-the-world! Kaboom!
OK... so life sucks.

What can we do about it?
Step 1. Hypothesize
     Setting the initiating occupancy fraction low
     puts off GC, but it eventually happens no
     matter what
     We see promotion failed followed by long
     GC pause, even when 30% of the heap is free.
     Why? Must be fragmentation!
Step 2. Measure
     Let’s make some graphs:
     -XX:PrintFLSStatistics=1
     -XX:PrintCMSStatistics=1
     -XX:+PrintGCDetails
     -XX:+PrintGCDateStamps -verbose:gc
     -Xloggc:/.../logs/gc-$(hostname).log
     FLS Statistics: verbose information about the
     state of the free space inside the old generation

         Free space - total amount of free space
         Num blocks - number of fragments it’s spread into
         Max chunk size
     parse-fls-statistics.py → R and
     ggplot2
3 YCSB workloads, graphed
Workload 1
Insert-only
Workload 2
Read-only with cache churn
Workload 3
Read-only with no cache churn




          So boring I didn’t make a graph!
          All allocations are short lived → stay in young
          gen
Recap
What we have learned?

             Fragmentation is what causes long GC pauses
             Write load seems to cause fragmentation
             Read load (LRU cache churn) isn’t nearly so
             bad1




       1
           At least for my test workloads
Taking a step back
Why does write load cause fragmentation?

          Imagine we have 5 regions, A through E
          We take writes in the following order into an
          empty old generation:
Taking a step back
Why does write load cause fragmentation?

          Imagine we have 5 regions, A through E
          We take writes in the following order into an
          empty old generation:
          A
Taking a step back
Why does write load cause fragmentation?

          Imagine we have 5 regions, A through E
          We take writes in the following order into an
          empty old generation:
          AB
Taking a step back
Why does write load cause fragmentation?

          Imagine we have 5 regions, A through E
          We take writes in the following order into an
          empty old generation:
          ABC
Taking a step back
Why does write load cause fragmentation?

          Imagine we have 5 regions, A through E
          We take writes in the following order into an
          empty old generation:
          ABCD
Taking a step back
Why does write load cause fragmentation?

          Imagine we have 5 regions, A through E
          We take writes in the following order into an
          empty old generation:
          ABCDE
Taking a step back
Why does write load cause fragmentation?

          Imagine we have 5 regions, A through E
          We take writes in the following order into an
          empty old generation:
          ABCDEABCEDDAECBACEBCED
Taking a step back
Why does write load cause fragmentation?

          Imagine we have 5 regions, A through E
          We take writes in the following order into an
          empty old generation:
          ABCDEABCEDDAECBACEBCED
          Now B’s memstore fills up and flushes. We’re
          left with:
Taking a step back
Why does write load cause fragmentation?

          Imagine we have 5 regions, A through E
          We take writes in the following order into an
          empty old generation:
          ABCDEABCEDDAECBACEBCED
          Now B’s memstore fills up and flushes. We’re
          left with:
          A CDEA CEDDAEC ACE CED
Taking a step back
Why does write load cause fragmentation?

          Imagine we have 5 regions, A through E
          We take writes in the following order into an
          empty old generation:
          ABCDEABCEDDAECBACEBCED
          Now B’s memstore fills up and flushes. We’re
          left with:
          A CDEA CEDDAEC ACE CED
          Looks like fragmentation!
Also known as swiss cheese




If every write is exactly the same size, it’s fine -
we’ll fill in those holes. But this is seldom true.
A solution
     Crucial issue is that memory allocations for a
     given memstore aren’t next to each other in
     the old generation.
     When we free an entire memstore we only get
     tiny blocks of free space
     What if we ensure that the memory for a
     memstore is made of large blocks?
     Enter the MemStore Local Allocation Buffer
     (MSLAB)
What’s an MSLAB?
    Each MemStore has an instance of
    MemStoreLAB.
    MemStoreLAB has a 2MB curChunk with
    nextFreeOffset starting at 0.
    Before inserting a KeyValue that points to
    some byte[], copy the data into curChunk
    and increment nextFreeOffset by data.length
    Insert a KeyValue pointing inside curChunk
    instead of the original data.
    If a chunk fills up, just make a new one.
    This is all lock-free, using atomic
    compare-and-swap instructions.
How does this help?
     The original data to be inserted becomes very
     short-lived, and dies in the young generation.
     The only data in the old generation is made of
     2MB chunks
     Each chunk only belongs to one memstore.
     When we flush, we always free up 2MB chunks,
     and avoid the swiss cheese effect.
     Next time we allocate, we need exactly 2MB
     chunks again, and there will definitely be space.
Does it work?
It works!



    Have seen basically zero full
    GCs with MSLAB enabled,
     after days of load testing
Summary
    Most GC pauses are caused by fragmentation
    in the old generation.
    The CMS collector doesn’t compact, so the
    only way it can fight fragmentation is to pause.
    The MSLAB moves all MemStore allocations
    into contiguous 2MB chunks in the old
    generation.
    No more GC pauses!
How to try it
   1. Upgrade to HBase 0.90.1 (included in
      CDH3b4)
   2. Set hbase.hregion.memstore.mslab.enabled to
      true
          Also tunable:
          hbase.hregion.memstore.mslab.chunksize
          (in bytes, default 2M)

          hbase.hregion.memstore.mslab.max.allocation
          (in bytes, default 256K)
   3. Report back your results!
Future work
     Flat 2MB chunk per region → 2GB RAM
     minimum usage for 1000 regions
     incrementColumnValue currently bypasses
     MSLAB for subtle reasons
     We’re doing an extra memory copy into
     MSLAB chunk - we can optimize this out
     Maybe we can relax
     CMSInitiatingOccupancyFraction back up
     a bit?
So I don’t forget...
Corporate shill time




    Cloudera offering HBase training on March 10th.

    15 percent off with hbase meetup code.
todd@cloudera.com
  Twitter: @tlipcon
#hbase IRC: tlipcon

   P.S. we’re hiring!

More Related Content

HBase HUG Presentation: Avoiding Full GCs with MemStore-Local Allocation Buffers

  • 1. Avoiding Full GCs with MemStore-Local Allocation Buffers Todd Lipcon todd@cloudera.com Twitter: @tlipcon #hbase IRC: tlipcon February 22, 2011
  • 2. Outline Background HBase and GC A solution Summary
  • 3. Intro / who am I? Been working on data stuff for a few years HBase, HDFS, MR committer Cloudera engineer since March ’09
  • 4. Motivation HBase users want to use large heaps Bigger block caches make for better hit rates Bigger memstores make for larger and more efficient flushes Machines come with 24G-48G RAM But bigger heaps mean longer GC pauses Around 10 seconds/GB on my boxes. Several minute GC pauses wreak havoc
  • 5. GC Disasters 1. Client requests stalled 1 minute “latency” is just as bad as unavailability 2. ZooKeeper sessions stop pinging The dreaded “Juliet Pause” scenario 3. Triggers all kinds of other nasty bugs
  • 6. Yo Concurrent Mark-and-Sweep (CMS)! What part of Concurrent didn’t you understand?
  • 7. Java GC Background Java’s GC is generational Generational hypothesis: most objects either die young or stick around for quite a long time Split the heap into two “generations” - young (aka new) and old (aka tenured) Use different algorithms for the two generations We usually recommend -XX:+UseParNewGC -XX:+UseConcMarkSweepGC Young generation: Parallel New collector Old generation: Concurrent-mark-sweep
  • 8. The Parallel New collector in 60 seconds Divide the young generation into eden, survivor-0, and survivor-1 One survivor space is from-space and the other is to-space Allocate all objects in eden When eden fills up, stop the world and copy live objects from eden and from-space into to-space, swap from and to Once an object has been copied back and forth N times, copy it to the old generation N is the “Tenuring Threshold” (tunable)
  • 9. The CMS collector in 60 seconds A bit simplified, sorry... Several phases: 1. initial-mark (stop-the-world) - marks roots (eg thread stacks) 2. concurrent-mark - traverse references starting at roots, marking what’s live 3. concurrent-preclean - another pass of the same (catch new objects) 4. remark (stop-the-world) - any last changed/new objects 5. concurrent-sweep - clean up dead objects to update free space tracking Note: dead objects free up space, but it’s not contiguous. We’ll come back to this later!
  • 10. CMS failure modes 1. When young generation collection happens, it needs space in the old gen. What if CMS is already in the middle of concurrent work, but there’s no space? The dreaded concurrent mode failure! Stop the world and collect. Solution: lower value of -XX:CMSInitiatingOccupancyFraction so CMS starts working earlier 2. What if there’s space in the old generation, but not enough contiguous space to promote a large object? We need to compact the old generation (move all free space to be contiguous) This is also stop-the-world! Kaboom!
  • 11. OK... so life sucks. What can we do about it?
  • 12. Step 1. Hypothesize Setting the initiating occupancy fraction low puts off GC, but it eventually happens no matter what We see promotion failed followed by long GC pause, even when 30% of the heap is free. Why? Must be fragmentation!
  • 13. Step 2. Measure Let’s make some graphs: -XX:PrintFLSStatistics=1 -XX:PrintCMSStatistics=1 -XX:+PrintGCDetails -XX:+PrintGCDateStamps -verbose:gc -Xloggc:/.../logs/gc-$(hostname).log FLS Statistics: verbose information about the state of the free space inside the old generation Free space - total amount of free space Num blocks - number of fragments it’s spread into Max chunk size parse-fls-statistics.py → R and ggplot2
  • 14. 3 YCSB workloads, graphed
  • 17. Workload 3 Read-only with no cache churn So boring I didn’t make a graph! All allocations are short lived → stay in young gen
  • 18. Recap What we have learned? Fragmentation is what causes long GC pauses Write load seems to cause fragmentation Read load (LRU cache churn) isn’t nearly so bad1 1 At least for my test workloads
  • 19. Taking a step back Why does write load cause fragmentation? Imagine we have 5 regions, A through E We take writes in the following order into an empty old generation:
  • 20. Taking a step back Why does write load cause fragmentation? Imagine we have 5 regions, A through E We take writes in the following order into an empty old generation: A
  • 21. Taking a step back Why does write load cause fragmentation? Imagine we have 5 regions, A through E We take writes in the following order into an empty old generation: AB
  • 22. Taking a step back Why does write load cause fragmentation? Imagine we have 5 regions, A through E We take writes in the following order into an empty old generation: ABC
  • 23. Taking a step back Why does write load cause fragmentation? Imagine we have 5 regions, A through E We take writes in the following order into an empty old generation: ABCD
  • 24. Taking a step back Why does write load cause fragmentation? Imagine we have 5 regions, A through E We take writes in the following order into an empty old generation: ABCDE
  • 25. Taking a step back Why does write load cause fragmentation? Imagine we have 5 regions, A through E We take writes in the following order into an empty old generation: ABCDEABCEDDAECBACEBCED
  • 26. Taking a step back Why does write load cause fragmentation? Imagine we have 5 regions, A through E We take writes in the following order into an empty old generation: ABCDEABCEDDAECBACEBCED Now B’s memstore fills up and flushes. We’re left with:
  • 27. Taking a step back Why does write load cause fragmentation? Imagine we have 5 regions, A through E We take writes in the following order into an empty old generation: ABCDEABCEDDAECBACEBCED Now B’s memstore fills up and flushes. We’re left with: A CDEA CEDDAEC ACE CED
  • 28. Taking a step back Why does write load cause fragmentation? Imagine we have 5 regions, A through E We take writes in the following order into an empty old generation: ABCDEABCEDDAECBACEBCED Now B’s memstore fills up and flushes. We’re left with: A CDEA CEDDAEC ACE CED Looks like fragmentation!
  • 29. Also known as swiss cheese If every write is exactly the same size, it’s fine - we’ll fill in those holes. But this is seldom true.
  • 30. A solution Crucial issue is that memory allocations for a given memstore aren’t next to each other in the old generation. When we free an entire memstore we only get tiny blocks of free space What if we ensure that the memory for a memstore is made of large blocks? Enter the MemStore Local Allocation Buffer (MSLAB)
  • 31. What’s an MSLAB? Each MemStore has an instance of MemStoreLAB. MemStoreLAB has a 2MB curChunk with nextFreeOffset starting at 0. Before inserting a KeyValue that points to some byte[], copy the data into curChunk and increment nextFreeOffset by data.length Insert a KeyValue pointing inside curChunk instead of the original data. If a chunk fills up, just make a new one. This is all lock-free, using atomic compare-and-swap instructions.
  • 32. How does this help? The original data to be inserted becomes very short-lived, and dies in the young generation. The only data in the old generation is made of 2MB chunks Each chunk only belongs to one memstore. When we flush, we always free up 2MB chunks, and avoid the swiss cheese effect. Next time we allocate, we need exactly 2MB chunks again, and there will definitely be space.
  • 34. It works! Have seen basically zero full GCs with MSLAB enabled, after days of load testing
  • 35. Summary Most GC pauses are caused by fragmentation in the old generation. The CMS collector doesn’t compact, so the only way it can fight fragmentation is to pause. The MSLAB moves all MemStore allocations into contiguous 2MB chunks in the old generation. No more GC pauses!
  • 36. How to try it 1. Upgrade to HBase 0.90.1 (included in CDH3b4) 2. Set hbase.hregion.memstore.mslab.enabled to true Also tunable: hbase.hregion.memstore.mslab.chunksize (in bytes, default 2M) hbase.hregion.memstore.mslab.max.allocation (in bytes, default 256K) 3. Report back your results!
  • 37. Future work Flat 2MB chunk per region → 2GB RAM minimum usage for 1000 regions incrementColumnValue currently bypasses MSLAB for subtle reasons We’re doing an extra memory copy into MSLAB chunk - we can optimize this out Maybe we can relax CMSInitiatingOccupancyFraction back up a bit?
  • 38. So I don’t forget... Corporate shill time Cloudera offering HBase training on March 10th. 15 percent off with hbase meetup code.
  • 39. todd@cloudera.com Twitter: @tlipcon #hbase IRC: tlipcon P.S. we’re hiring!