Deep Application Monitoring
using Statsd and more

Pete Fritchman <petef@mozilla.com>
Mozilla - Service Operations
PICC 2012
May 12, 2012
http://fetep.net/picc12 (these slides and some links)

More Application Visibility

Graph Everything (almost)

Understand what you are graphing and how it impacts the stack.

The list of metrics to collect and graph is not static; it should change and grow over time.

Monitor What Matters

Like the metrics we graph, the list of metrics we monitor should change over time.

How?

Lots of monitoring software options (some commercial, some open-source) -- choose what's right for you.

 

 

One approach using open source software...

Demo

What is a Metric?

testapp.mysql.errors.colo1.web1

Metric name: testapp.mysql.errors
Colo: colo1
Host: web1

 

Metric Update

testapp.mysql.errors.colo1.web1 1336436563 0.4

Metric: testapp.mysql.errors.colo1.web1
Time: 1336436563 (seconds since Epoch)
Value: 0.4

Collect Metrics

Collecting Metrics -- System Level

 

Lots of options -- choose anything that can send a properly formatted metric update to graphite/AMQP.

Collecting Metrics -- Application Level

What is statsd?

graphite frontend that does aggregation

run one statsd per host

one UDP packet to localhost per stat update

pick a statsd implementations that can output to AMQP

 

 

Scraping Logfiles

For apps we do not control the source to, sometimes the easiest way to instrument them is by scraping their log files.

 

Use logstash to parse logs (e.g. http access log), extract useful fields (e.g. response code, request time, etc.) and send data to statsd

Sample Logstash Config

Input -- where are the files?

input {
  file {
    path => "/var/log/httpd/testapp.example.com.access_*[0-9]"
    type => "testapp_lb"
  }
}
        

Sample Logstash Config

Filter -- parse the data

10.0.0.198 testapp.example.com [10/May/2012:07:42:39 -0700]
"GET /foo/123 HTTP/1.1" 200 458 node_s:0.016618
req_s:0.222838 retries:0
        
filter {
  grok {
    type => "testapp_lb"
    pattern => "%{IP:ip} %{HOST:host} \[%{HTTPDATE:ts}\]
\"%{WORD:verb} %{URIPATHPARAM:request} HTTP/%{NUMBER:httpversion}\"
%{NUMBER:response} %{NUMBER:bytes} node_s:%{NUMBER:node_time}
req_s:%{NUMBER:req_time} retries:{NUMBER:retries}"
    named_captures_only => true
  }
}
        

Sample Logstash Config

Output -- statsd counters and timers

output {
  statsd {
    type => "testapp_lb"
    increment => [ "testapp_lb.hits",
                   "testapp_lb.response_code.%{response}",
                 ]
    timing    => [ "testapp_lb.request_time", "%{req_time}",
                   "testapp_lb.response_bytes", "%{bytes}"
                   "testapp_lb.retries", "%{retries}"
                 ]
  }
}
        

Instrumenting an Application

For more fine-grained stats, integrate a statsd client into the application and directly update counters and timers.

Sample App Code

def get_foo(id)
    foo = @memcache.get(id)
    if !foo
      # cache miss
      foo = @db.query("select foo from bar where id = ?", id)
    end
    return foo
end
        

Sample App Code (instrumented)

def get_foo(id)
    @statsd.increment("controller.get_foo")
    foo = @memcache.get(id)
    if !foo
      # cache miss
      @statsd.increment("cache.get_foo.miss")
      @statsd.timing("db.query_time") {
        foo = @db.query("select foo from bar where id = ?", id)
      }
    else
      @statsd.increment("cache.get_foo.hit")
    end
    return foo
end
        

Use Metrics

Sample Pencil Config

  testapp_mysql_query_time:
    title: "testapp mysql query time"
    targets:
      stats.timers.testapp.mysql.query.mean:
        :key: mean
        :color: green
      stats.timers.testapp.mysql.query.upper_90:
        :key: upper_90
        :color: blue
    vtitle: ms
    hosts: ["web*"]
        

Sample cepmon Config

threshold('testapp_500', 'stats.logstash.testapp_lb.hit.500',
          :operator => '>',
          :threshold => 0.5,
          :units => '500s/sec',
          :average_over => "1 min"
         )
        

Conclusion

 

Questions?

 

These slides, along with more links and resources: http://fetep.net/picc12/