Deep Application Monitoring
using Statsd and more

Pete Fritchman <>
Mozilla - Service Operations
PICC 2012
May 12, 2012 (these slides and some links)

More Application Visibility

Graph Everything (almost)

Understand what you are graphing and how it impacts the stack.

The list of metrics to collect and graph is not static; it should change and grow over time.

Monitor What Matters

Like the metrics we graph, the list of metrics we monitor should change over time.


Lots of monitoring software options (some commercial, some open-source) -- choose what's right for you.



One approach using open source software...


What is a Metric?


Metric name: testapp.mysql.errors
Colo: colo1
Host: web1


Metric Update

testapp.mysql.errors.colo1.web1 1336436563 0.4

Metric: testapp.mysql.errors.colo1.web1
Time: 1336436563 (seconds since Epoch)
Value: 0.4

Collect Metrics

Collecting Metrics -- System Level


Lots of options -- choose anything that can send a properly formatted metric update to graphite/AMQP.

Collecting Metrics -- Application Level

What is statsd?

graphite frontend that does aggregation

run one statsd per host

one UDP packet to localhost per stat update

pick a statsd implementations that can output to AMQP



Scraping Logfiles

For apps we do not control the source to, sometimes the easiest way to instrument them is by scraping their log files.


Use logstash to parse logs (e.g. http access log), extract useful fields (e.g. response code, request time, etc.) and send data to statsd

Sample Logstash Config

Input -- where are the files?

input {
  file {
    path => "/var/log/httpd/*[0-9]"
    type => "testapp_lb"

Sample Logstash Config

Filter -- parse the data [10/May/2012:07:42:39 -0700]
"GET /foo/123 HTTP/1.1" 200 458 node_s:0.016618
req_s:0.222838 retries:0
filter {
  grok {
    type => "testapp_lb"
    pattern => "%{IP:ip} %{HOST:host} \[%{HTTPDATE:ts}\]
\"%{WORD:verb} %{URIPATHPARAM:request} HTTP/%{NUMBER:httpversion}\"
%{NUMBER:response} %{NUMBER:bytes} node_s:%{NUMBER:node_time}
req_s:%{NUMBER:req_time} retries:{NUMBER:retries}"
    named_captures_only => true

Sample Logstash Config

Output -- statsd counters and timers

output {
  statsd {
    type => "testapp_lb"
    increment => [ "testapp_lb.hits",
    timing    => [ "testapp_lb.request_time", "%{req_time}",
                   "testapp_lb.response_bytes", "%{bytes}"
                   "testapp_lb.retries", "%{retries}"

Instrumenting an Application

For more fine-grained stats, integrate a statsd client into the application and directly update counters and timers.

Sample App Code

def get_foo(id)
    foo = @memcache.get(id)
    if !foo
      # cache miss
      foo = @db.query("select foo from bar where id = ?", id)
    return foo

Sample App Code (instrumented)

def get_foo(id)
    foo = @memcache.get(id)
    if !foo
      # cache miss
      @statsd.timing("db.query_time") {
        foo = @db.query("select foo from bar where id = ?", id)
    return foo

Use Metrics

Sample Pencil Config

    title: "testapp mysql query time"
        :key: mean
        :color: green
        :key: upper_90
        :color: blue
    vtitle: ms
    hosts: ["web*"]

Sample cepmon Config

threshold('testapp_500', 'stats.logstash.testapp_lb.hit.500',
          :operator => '>',
          :threshold => 0.5,
          :units => '500s/sec',
          :average_over => "1 min"





These slides, along with more links and resources: