Maintaining sanity on the Cloud: Key ELB metrics to watch for in a load spike

When experiencing a surge in inbound requests, we have to watch the ELB's cloudwatch metrics closely. The metrics are documented in the MonitoringLoadBalancerWithCW

The key metrics are

RequestCount
Latency
HTTPCode_ELB_5XX
HTTPCode_Backend_5XX
SurgeQueueLength

Typically, you will see a linear relationship between 'RequestCount' and 'Latency' metrics. When the load increases, the latency will also increase correspondingly. With default settings you will see a ELB timeout of 60 secs getting invoked if latency intervals are greater than ELB timeout. Latency metric is indicative of the time duration that ELB has to wait for a response from the instance to which it has handed the request. If the instances takes longer to respond (HTTP 200, 4xx or 5xx codes) then we will see increased latency. HTTPCode_ELB_5xx metric indicates the no. of occurrences of ELB failing to handle the incoming request and ELB directly send a http 5xx error code back to the client. HTTPCode_Backend_5xx indicates the occurrences of errors in the backend instances failing to get a valid response from end service. SurgeQueueLength metric indicates the no. of requests that have been queued up by the ELB waiting for a healthy instance to become available.

The CloudWatch metrics of RequestCount and Latency showing linear relationship will look like

NOTE - If your instances behind the ELB are in the same zone, then you may want to disable the "cross zone load balancing" feature of the ELB so reduce the performance overhead by a small amount.

you can also enable access log and access collection as per the AWS documentation links below:-

http://docs.aws.amazon.com/ElasticLoadBalancing/latest/DeveloperGuide/enable-access-logs.html
http://docs.aws.amazon.com/ElasticLoadBalancing/latest/DeveloperGuide/access-log-collection.html

Maintaining sanity on the Cloud

Tuesday, June 10, 2014

Key ELB metrics to watch for in a load spike

1 comment: