Deep Application Monitoring
using Statsd and more
Pete Fritchman <[email protected]>
Mozilla - Service Operations
PICC 2012
May 12, 2012
http://fetep.net/picc12 (these slides and some links)
More Application Visibility
-
System metrics and monitoring are not enough;
we must know about the application, too
-
Near real-time problem alerting
-
Faster outage recovery time
-
Better estimates for capacity planning
-
Make informed decisions when purchasing, scaling, and re-engineering
-
Everyone can see the same data and draw conclusions
Graph Everything (almost)
Understand what you are graphing and how it impacts the stack.
-
Good: database query latency
-
Good: rate of HTTP response codes (HTTP 200s per second, etc.)
-
Good: mysql innodb buffer pool dirty percentage
-
Bad: every number you can find in /proc
-
Bad: numbers that don't change (total amount of memory)
The list of metrics to collect and graph is not static; it should
change and grow over time.
Monitor What Matters
-
Is the load average below 10?
-
Is port 80 open?
-
Are there 8 or more processes named httpd?
-
Does the word "Exception" appear in a new log line?
-
Synthetic transactions
-
Are customers seeing HTTP 5xx response codes? At what rate?
-
Are we meeting our response time SLA?
-
Early warning signs to problems
Like the metrics we graph, the list of metrics we monitor should
change over time.
How?
Lots of monitoring software options (some commercial, some
open-source) -- choose what's right for you.
One approach using open source software...
- graphite
- statsd
- logstash
- pencil
- cepmon
Demo
What is a Metric?
testapp.mysql.errors.colo1.web1
Metric name: testapp.mysql.errors
Colo: colo1
Host: web1
Metric Update
testapp.mysql.errors.colo1.web1 1336436563 0.4
Metric: testapp.mysql.errors.colo1.web1
Time: 1336436563 (seconds since Epoch)
Value: 0.4
Collect Metrics
-
Collection infrastructure
- RabbitMQ
- graphite (carbon + whisper)
-
Collect system metrics
-
Collect application metrics
Collecting Metrics -- System Level
-
gmond, unicast to localhost
-
graphlia listener, convert ganglia to graphite names, calculates rates
-
send stats to AMQP "stats" exchange
Lots of options -- choose anything that can send a properly
formatted metric update to graphite/AMQP.
Collecting Metrics -- Application Level
-
poll application (SNMP, JMX, etc), send to AMQP directly
-
scrape log files, send to statsd
-
instrument application, send directly to statsd
What is statsd?
graphite frontend that does aggregation
run one statsd per host
one UDP packet to localhost per stat update
pick a statsd implementations that can output to AMQP
-
counters:
rate (counter change per second)
-
timers:
lower, upper, upper 90%ile, mean
Scraping Logfiles
For apps we do not control the source to, sometimes the easiest
way to instrument them is by scraping their log files.
Use logstash to parse logs (e.g. http access log), extract useful
fields (e.g. response code, request time, etc.) and send data to
statsd
Sample Logstash Config
Input -- where are the files?
input {
file {
path => "/var/log/httpd/testapp.example.com.access_*[0-9]"
type => "testapp_lb"
}
}
Sample Logstash Config
Filter -- parse the data
10.0.0.198 testapp.example.com [10/May/2012:07:42:39 -0700]
"GET /foo/123 HTTP/1.1" 200 458 node_s:0.016618
req_s:0.222838 retries:0
filter {
grok {
type => "testapp_lb"
pattern => "%{IP:ip} %{HOST:host} \[%{HTTPDATE:ts}\]
\"%{WORD:verb} %{URIPATHPARAM:request} HTTP/%{NUMBER:httpversion}\"
%{NUMBER:response} %{NUMBER:bytes} node_s:%{NUMBER:node_time}
req_s:%{NUMBER:req_time} retries:{NUMBER:retries}"
named_captures_only => true
}
}
Sample Logstash Config
Output -- statsd counters and timers
output {
statsd {
type => "testapp_lb"
increment => [ "testapp_lb.hits",
"testapp_lb.response_code.%{response}",
]
timing => [ "testapp_lb.request_time", "%{req_time}",
"testapp_lb.response_bytes", "%{bytes}"
"testapp_lb.retries", "%{retries}"
]
}
}
Instrumenting an Application
For more fine-grained stats, integrate a statsd client into
the application and directly update counters and timers.
Sample App Code
def get_foo(id)
foo = @memcache.get(id)
if !foo
# cache miss
foo = @db.query("select foo from bar where id = ?", id)
end
return foo
end
Sample App Code (instrumented)
def get_foo(id)
@statsd.increment("controller.get_foo")
foo = @memcache.get(id)
if !foo
# cache miss
@statsd.increment("cache.get_foo.miss")
@statsd.timing("db.query_time") {
foo = @db.query("select foo from bar where id = ?", id)
}
else
@statsd.increment("cache.get_foo.hit")
end
return foo
end
Use Metrics
-
Graph dashboard (display metrics, real-time)
-
Alerting (analyze metrics, real-time)
Sample Pencil Config
testapp_mysql_query_time:
title: "testapp mysql query time"
targets:
stats.timers.testapp.mysql.query.mean:
:key: mean
:color: green
stats.timers.testapp.mysql.query.upper_90:
:key: upper_90
:color: blue
vtitle: ms
hosts: ["web*"]
Sample cepmon Config
threshold('testapp_500', 'stats.logstash.testapp_lb.hit.500',
:operator => '>',
:threshold => 0.5,
:units => '500s/sec',
:average_over => "1 min"
)
Conclusion
-
Monitor real customer transactions against your SLA
-
Collect relevant metrics for graphing and alerting
-
Use graphs for problem resolution, planning, system status
-
Generate alerts from the same data that drives your graphs
Questions?
These slides, along with more links and resources:
http://fetep.net/picc12/