Deep Application Monitoring
using Statsd and more

Pete Fritchman <[email protected]>
Mozilla - Service Operations
PICC 2012
May 12, 2012
http://fetep.net/picc12 (these slides and some links)

More Application Visibility

System metrics and monitoring are not enough;
we must know about the application, too
Near real-time problem alerting
Faster outage recovery time
Better estimates for capacity planning
Make informed decisions when purchasing, scaling, and re-engineering
Everyone can see the same data and draw conclusions

Graph Everything (almost)

Understand what you are graphing and how it impacts the stack.

Good: database query latency
Good: rate of HTTP response codes (HTTP 200s per second, etc.)
Good: mysql innodb buffer pool dirty percentage
Bad: every number you can find in /proc
Bad: numbers that don't change (total amount of memory)

The list of metrics to collect and graph is not static; it should change and grow over time.

Monitor What Matters

~~Is the load average below 10?~~
~~Is port 80 open?~~
~~Are there 8 or more processes named httpd?~~
~~Does the word "Exception" appear in a new log line?~~
~~Synthetic transactions~~
Are customers seeing HTTP 5xx response codes? At what rate?
Are we meeting our response time SLA?
Early warning signs to problems

Like the metrics we graph, the list of metrics we monitor should change over time.

How?

Lots of monitoring software options (some commercial, some open-source) -- choose what's right for you.

One approach using open source software...

graphite
statsd
logstash
pencil
cepmon

What is a Metric?

testapp.mysql.errors.colo1.web1

Metric name: testapp.mysql.errors
Colo: colo1
Host: web1

Metric Update

testapp.mysql.errors.colo1.web1 1336436563 0.4

Metric: testapp.mysql.errors.colo1.web1
Time: 1336436563 (seconds since Epoch)
Value: 0.4

Collect Metrics

Collection infrastructure
- RabbitMQ
- graphite (carbon + whisper)
Collect system metrics
- gmond / graphlia
Collect application metrics
- statsd
- logstash

Collecting Metrics -- System Level

gmond, unicast to localhost
graphlia listener, convert ganglia to graphite names, calculates rates
send stats to AMQP "stats" exchange

Lots of options -- choose anything that can send a properly formatted metric update to graphite/AMQP.

Collecting Metrics -- Application Level

poll application (SNMP, JMX, etc), send to AMQP directly
scrape log files, send to statsd
instrument application, send directly to statsd

What is `statsd`?

graphite frontend that does aggregation

run one statsd per host

one UDP packet to localhost per stat update

pick a statsd implementations that can output to AMQP

counters: rate (counter change per second)

timers: lower, upper, upper 90%ile, mean

Scraping Logfiles

For apps we do not control the source to, sometimes the easiest way to instrument them is by scraping their log files.

Use logstash to parse logs (e.g. http access log), extract useful fields (e.g. response code, request time, etc.) and send data to statsd

Sample Logstash Config

Input -- where are the files?

input {
  file {
    path => "/var/log/httpd/testapp.example.com.access_*[0-9]"
    type => "testapp_lb"
  }
}

Sample Logstash Config

Filter -- parse the data

10.0.0.198 testapp.example.com [10/May/2012:07:42:39 -0700]
"GET /foo/123 HTTP/1.1" 200 458 node_s:0.016618
req_s:0.222838 retries:0

filter {
  grok {
    type => "testapp_lb"
    pattern => "%{IP:ip} %{HOST:host} \[%{HTTPDATE:ts}\]
\"%{WORD:verb} %{URIPATHPARAM:request} HTTP/%{NUMBER:httpversion}\"
%{NUMBER:response} %{NUMBER:bytes} node_s:%{NUMBER:node_time}
req_s:%{NUMBER:req_time} retries:{NUMBER:retries}"
    named_captures_only => true
  }
}

Sample Logstash Config

Output -- statsd counters and timers

output {
  statsd {
    type => "testapp_lb"
    increment => [ "testapp_lb.hits",
                   "testapp_lb.response_code.%{response}",
                 ]
    timing    => [ "testapp_lb.request_time", "%{req_time}",
                   "testapp_lb.response_bytes", "%{bytes}"
                   "testapp_lb.retries", "%{retries}"
                 ]
  }
}

Instrumenting an Application

For more fine-grained stats, integrate a statsd client into the application and directly update counters and timers.

Sample App Code

def get_foo(id)
    foo = @memcache.get(id)
    if !foo
      # cache miss
      foo = @db.query("select foo from bar where id = ?", id)
    end
    return foo
end

Sample App Code (instrumented)

def get_foo(id)
    @statsd.increment("controller.get_foo")
    foo = @memcache.get(id)
    if !foo
      # cache miss
      @statsd.increment("cache.get_foo.miss")
      @statsd.timing("db.query_time") {
        foo = @db.query("select foo from bar where id = ?", id)
      }
    else
      @statsd.increment("cache.get_foo.hit")
    end
    return foo
end

Use Metrics

Graph dashboard (display metrics, real-time)
- graphite-web
- pencil
Alerting (analyze metrics, real-time)
- cepmon

Sample Pencil Config

  testapp_mysql_query_time:
    title: "testapp mysql query time"
    targets:
      stats.timers.testapp.mysql.query.mean:
        :key: mean
        :color: green
      stats.timers.testapp.mysql.query.upper_90:
        :key: upper_90
        :color: blue
    vtitle: ms
    hosts: ["web*"]

Sample cepmon Config

threshold('testapp_500', 'stats.logstash.testapp_lb.hit.500',
          :operator => '>',
          :threshold => 0.5,
          :units => '500s/sec',
          :average_over => "1 min"
         )

Conclusion

Monitor real customer transactions against your SLA
Collect relevant metrics for graphing and alerting
Use graphs for problem resolution, planning, system status
Generate alerts from the same data that drives your graphs

Questions?

These slides, along with more links and resources: http://fetep.net/picc12/

Deep Application Monitoring using Statsd and more

More Application Visibility

Graph Everything (almost)

Monitor What Matters

How?

Demo

What is a Metric?

Metric Update

Collect Metrics

Collecting Metrics -- System Level

Collecting Metrics -- Application Level

What is statsd?

Scraping Logfiles

Sample Logstash Config

Sample Logstash Config

Sample Logstash Config

Instrumenting an Application

Sample App Code

Sample App Code (instrumented)

Use Metrics

Sample Pencil Config

Sample cepmon Config

Conclusion

Deep Application Monitoring
using Statsd and more

What is `statsd`?