Apache BookKeeper Observability — Part 1 — Introducing the Metrics

Jack Vanlightly
Splunk-MaaS
Published in
11 min readOct 21, 2021

Based on BookKeeper 4.14 as configured for Apache Pulsar.

We recently published a series about BookKeeper internals with the objective of helping people build a mental model of how BookKeeper works underneath the APIs. This mental model so far has focused on the read and write paths.

This series builds on that and explains all the read and write metrics currently available (version 4.14) and how to interpret them.

In our post “The Importance of Mental Models for Incident Response” we discussed that after running BookKeeper in production for a number of years we believe the USE method the most effective approach for fast diagnosis of performance issues. In this series we’ll frame a lot of our metrics explanations and recommendations in terms of Utilization and Saturation.

This series starts off with an introduction to the types of metrics that BookKeeper uses and introduce some high level metrics. In subsequent posts we’ll cover how to diagnose specific bottlenecks in the journal and DbLedgerStorage.

Remember that our series focuses on BookKeeper as configured for use in Apache Pulsar clusters and so some pluggable components are not included.

There are of course your system metrics (CPU, memory, disk utilization/saturation) which are out-of-scope for this series, but can be vitally important for diagnosing performance issues.

Objectives of this series

Everyone’s environment is different. Different hardware, different load patterns, some on premise some on cloud etc. The example in this series are simply examples. The main objective of this series is to arm you with knowledge about the metrics that exist and the kinds of things you can do with them. In the end this is a complex topic, you are an expert user that applies expert knowledge to solve problems. This series will hopefully make you more effective at troubleshooting using the available BookKeeper metrics.

Quick Guide to Metric Types in BookKeeper

BookKeeper emits metrics from different scopes and sub-scopes. A scope relates to the component the metric is emitted from and metric names are prefixed with these scope names. For example:

  • bookkeeper_server: emitted from the Netty server layer.
  • bookie: emitted from the bookie internals.
  • journal: related to the journal specifically.
  • ledger: related to ledger storage.
  • index: related to ledger indexes.
  • entrylogger: related to writing of entries to entry log files
  • gc: related to garbage collection (the removal of deleted ledgers).

The use of the sub-scopes is not consistent so don’t expect all metrics to include a sub-scope.

An example of a metric name is: bookie_journal_JOURNAL_QUEUE_SIZE which has a scope, sub-scope and metric.

BookKeeper employs the use of three types of metric.

Counters, specifically monotonic counters (i.e. they don’t get decremented), used for metrics such as write and read rates. An example is bookie_rejected_write_requests which is a counter for when the write thread cannot add an entry to the write cache because it is full. These counters are typically used for calculating rates.

Gauges, used for metrics such as cache and queue sizes. An example is bookie_journal_JOURNAL_QUEUE_SIZE which is the length of the in-memory queue of entries to be written to the current journal file. Note that some gauges are implemented as counters with inc() and dec() operations (i.e. not monotonic counters).

Gauges are basically samples and so can miss short-lived spikes. This can be confusing sometimes when a rate metric (a counter) spikes but a corresponding gauge that you’d expect to spike as well does not. A good example are in-memory queues that quickly become full then empty again before the next sample is taken.

The OpStatsLogger metrics class offers two APIs for measuring either values or latencies. Each OpStatsLogger instance emits 6 metrics:

  • Histogram for all successful operation values (either latency or value)
  • a counter for all successful operations with the name suffix of _count.
  • a sum for all successful operation values (either latency or value) with the name suffix of _sum.
  • The same three again but for failed operations

The histograms come with the following percentile labels:

  • quantile=”0.5”
  • quantile=”0.75”
  • quantile=”0.999”
  • quantile=”0.9999”
  • quantile=”1.0”

An example is the bookkeeper_server_READ_ENTRY_REQUEST OpStatsLogger which emits the following metrics:

  • bookkeeper_server_READ_ENTRY_REQUEST (histogram of read request latencies)
  • bookkeeper_server_READ_ENTRY_REQUEST_count (counter of requests)
  • bookkeeper_server_READ_ENTRY_REQUEST_sum (sum of all request latencies)

Each of which can be further sliced and diced by success (true/false) and quantile labels.

So we can calculate rates from either the counter metrics or the OpStatsLogger which also includes a counter and we can calculate things like total time spent on a given operation using the sum metric which is very useful as we’ll cover in subsequent posts.

Detecting that there is a performance problem

The most common way of knowing whether a BookKeeper cluster or an individual bookie has a performance issue is either due to the Pulsar topic metrics or the high level bookie metrics for read and writes.

An easy to detect problem is that the read or write rate 4 hours ago was X and for the last 1 hour it’s been a much lower value of Y. But not all performance issues are that clear. Sometimes it simply comes down to your cluster being under-provisioned for the load and not reaching the throughput you want.

Under-provisioning

Under-provisioning basically means that you haven’t sized your cluster adequately for the current load being placed on it. For example, you sized your cluster based on load X six months ago and but now you have load Y and the cluster simply isn’t provisioned to handle it.

Or may be when you did your sizing analysis you based it on when everything was operating normally, Pulsar consumers were tailing the topics and being served from Pulsar broker caches. Now 6 months later, your consumers needed a couple of hours of downtime and now they’ve resumed consumption and are working as hard as they can to drain the backlog. None of the messages are in the broker caches so now those messages are being read from your bookies at a very high rate and the bookies were only provisioned based on a “happy path” write-heavy workload.

In these situations overload indicators such as utilization and saturation can help as they can confirm if your bookies, or one specific bookie, is a bottleneck.

High level metrics

Fig 1: The three main high level metric areas

The main high level read and write metrics can be classified into three areas:

  1. The request rate and latencies. The request latencies measure the elapsed time from the moment a request is read from a network buffer to the moment the response is written to a network buffer.
  2. The time spent queued in each thread pool thread (read or write thread pool) and also the size of each thread pool thread task queue.
  3. The operation rate and latencies. These latencies measure the elapsed time from the moment the read or write operation actually starts being executed to the moment it completes.

The request rate and latencies (1) tell us the current request demand and how long it is taking to fulfil these requests. The request latency metrics should be treated with care as a heavily overloaded bookie will be rejecting most requests immediately meaning that the p50 or even p99 latency may be very low.

The thread pool queue metrics (2) are good indicators of saturation. When the task queues start filling it means the bookie is not keeping up with the request rate and when the queues are full the bookie is saturated for that type of request (read or write).

Finally the operation metrics (3) actually show us the rate and latencies of performing the read and write operations. Seeing large request latencies but low operation latencies should correlate with large thread pool queue sizes and latencies. Also seeing higher request rates than operation rates means that requests are being rejected mostly likely due to full thread pool task queues.

Request metrics (1)

  1. bookkeeper_server_ADD_ENTRY_REQUEST (OpStatsLogger): Latency histogram of the complete write request time, including the time spent queued waiting to be processed. Also using the success label the success and failure rates can be tracked separately if desired.
  2. bookkeeper_server_ADD_ENTRY_REQUEST_count: The write request rate can be measured via this counter.
  3. bookkeeper_server_READ_ENTRY_REQUEST (OpStatsLogger): Latency histogram of the complete read request time, including the time spent queued waiting to be processed.
  4. bookkeeper_server_READ_ENTRY_REQUEST_count: The read request rate can be measured via this counter.

Thread pool thread task queue metrics (2)

There are three thread pool metrics we really care about:

  • queue length — this is reported per thread in the pool, with a suffix in the metric name for the thread ordinal. For example, for the read thread pool with its default of 8 threads, the queue length metric is repeated 8 times, each with a different suffix (0–7).
  • time each request spent queued — this is aggregated across all threads of the pool.
  • time each request spent being executed (rather than queued) — this is aggregated across all threads of the pool.

The following examples use the _0 suffix for the task queue length metrics.

  1. bookkeeper_server_BookieWriteThreadPool_queue_0 (gauge): The current size of the write thread task queue. The default queue length limit per thread is 10000 and can be configured with maxPendingAddRequestsPerThread. Full task queues indicate that the bookie write path is saturated. Netty threads will likely be rejecting many write requests due to there being no space on the write thread task queues. Unfortunately there is not a specific metric for rejected requests (only when rejected by DbLedgerStorage), rejected requests are logged as failed requests (bookkeeper_server_ADD_ENTRY_REQUEST).
  2. bookkeeper_server_BookieWriteThreadPool_task_queued (OpStatsLogger): Latency histogram for how long write requests are queued before being executed. Significant queue times indicate the write path is saturated.
  3. bookkeeper_server_BookieWriteThreadPool_task_execution (OpStatsLogger): Latency histogram for how long write requests spend being being executed. Using the bookkeeper_server_BookieWriteThreadPool_task_execution_sum counter is useful for calculating utilization of the write threads which is described in the USE Write Metrics post.
  4. bookkeeper_server_BookieReadThreadPool_queue_0 (gauge): Same as for the write thread pool. The queue size limit is configured with maxPendingReadRequestsPerThread. When the queue size reaches the limit the read path is fully saturated and read requests are getting rejected at the Netty layer.
  5. bookkeeper_server_BookieReadThreadPool_task_queued (OpStatsLogger): Latency histogram for how long read requests are queued before being executed. Significant queue times indicate the read path is saturated.
  6. bookkeeper_server_BookieReadThreadPool_task_execution (OpStatsLogger): Latency histogram for how long read requests spend being being executed. Using the bookkeeper_server_BookieReadThreadPool_task_execution_sum counter is useful for calculating utilization of the read threads which has an example below and is described in the USE Read Metrics post.

Note that due to there currently being no metric for requests rejected due to full thread pool queues, the fact that thread pool task queues are full is the next best indicator. You can of course plot the failed write requests and if you don’t see errors in your bookie logs then they are likely rejected reads/writes.

NOTE! Placing the thread ordinal in the metric name is a bit old school, as these metrics go back a long time. It can make aggregation of the task queue lengths over an entire thread pool difficult. In a future BookKeeper release it is planned to move the thread ordinal suffixes into a label.

Operation Metrics (3)

  1. bookkeeper_server_ADD_ENTRY (OpStatsLogger): Latency histogram of the time spent actually performing writes. Does not include time spent queued. Also the “success” label and the success and failure rates can be tracked separately if desired.
  2. bookkeeper_server_ADD_ENTRY_count. The write operation rate can be measured via this counter.
  3. bookie_WRITE_BYTES (counter): The number of bytes written.
  4. bookkeeper_server_READ_ENTRY (OpStatsLogger): Latency histogram of the time spent actually performing reads. Does not include time spent queued.
  5. bookkeeper_server_READ_ENTRY_count. The read operation rate can be measured via this counter.
  6. bookie_READ_BYTES (counter): The number of bytes read.

Some Examples (with Prometheus queries)

Read request and operation rates

Fig 2. High level read metrics show load increasing before reaching a tipping point where reads start failing (being rejected in this case). No cause, just the effect.

Read request rate (success and fail): sum(irate(bookkeeper_server_READ_ENTRY_REQUEST_count[1m])) by (pod)

Failed read request rate: sum(irate(bookkeeper_server_READ_ENTRY_REQUEST_count{success=”false”}[1m])) by (pod)

Read operation rate: sum(irate(bookkeeper_server_READ_ENTRY_count[1m])) by (pod)

Read throughput: sum(irate(bookie_READ_BYTES[1m])) by (pod)

Read Utilization

In the below Grafana panel we see load increase until the read thread pool is spending all its time executing requests. We’ll cover utilization more in the read and write USE metrics posts.

Fig 3. The time the read thread pool spends executing read requests. With 8 read threads the maximum time spent per second is 8.

Read threads time spent per second: sum(irate(bookkeeper_server_BookieReadThreadPool_task_execution_sum[1m])) by (pod)

Signs of Write Saturation

In the below Grafana panel we see two clear signs of write saturation: failed write requests and large write thread queues, with corresponding queue time latency increases.

Fig 4. As the write path becomes saturated, the failed write requests go up with corresponding increases in write thread pool task queue length and task queue latency.

Failed write request rate: sum(irate(bookkeeper_server_ADD_ENTRY_REQUEST_count{success=”false”}[1m])) by (pod)

Write Thread 0 Task Queue Length: sum(bookkeeper_server_BookieWriteThreadPool_queue_0) by (pod)

Write Thread Pool Queue Time: sum(bookkeeper_server_BookieWriteThreadPool_task_queued{success=”true”, quantile=”0.99"}) by (pod)

We’ll be covering signs of saturation in more detail in subsequent posts.

Other High Level Metrics

There are some other high level metrics that may be interesting, depending on your configuration.

  1. bookkeeper_server_ADD_ENTRY_IN_PROGRESS (gauge): The total number of write requests in progress, including those queued waiting to be executed. In of itself it is not particularly useful but if the maxAddsInProgressLimit configuration is being used then this becomes an important metric. See the post on back pressure to learn more about this config.
  2. bookkeeper_server_ADD_ENTRY_BLOCKED (gauge): By default there is no limit to the number of in progress write requests, but if configured (maxAddsInProgressLimit) then when this limit is reached the affected Netty thread blocks until the number of in-progress writes dips below the limit again. Each time a Netty thread blocks, the gauge value increments and on being unblocked it decrements.
  3. bookkeeper_server_ADD_ENTRY_BLOCKED_WAIT (OpStatsLogger): Latency histogram for the time Netty threads spent blocked due to reaching the maxAddsInProgressLimit limit.
  4. bookkeeper_server_READ_ENTRY_IN_PROGRESS (gauge): See write counterpart.
  5. bookkeeper_server_READ_ENTRY_BLOCKED (gauge) See write counterpart.
  6. bookkeeper_server_READ_ENTRY_BLOCKED_WAIT (gauge) See write counterpart.
  7. bookkeeper_server_CHANNEL_WRITE (OpStatsLogger): Latency histogram for time spent sending responses over Netty channels. Because it’s an OpStatsLogger it records successful responses and failed responses separately. Failed responses can be due to timeouts when waitTimeoutOnResponseBackpressureMs is configured or other errors such as connections being closed or out of memory errors. So this can be a useful metric for diagnosing when the network between the bookie and the clients is the problem.

Summary

These top-level metrics can show you that you have a problem, i.e. that the read and/or write rate has dropped, latencies are higher, or that there are clear signs of saturation in the read and/or write path.

Currently one of the most useful metrics is not currently available: rejected requests due to full thread pool task queues. They are reported as request failures. So it is highly recommended to track request failures, high failure rates either indicate there is some kind of error that is occurring (bad) or that requests are being rejected due to saturation (bad also). There will be a specific metric for this in the near future.

If we see there is a problem what next?

Next we use our knowledge of the internals of BookKeeper from our BookKeeper internals series combined with the metrics that are described in the next few posts where we cover read and write bottlenecks in the journal and DbLedgerStorage components.

Series links:

--

--