Troubleshoot on AWS Elb Return 5xxs Error: 504 Gateway Timeout

We have a service that sometimes got a 504 Gateway Time-Out response from ELB( actually is CLB).

When investigating the root cause, we found out that CLB’s 3 instances are normal in the current two weeks, and the 504 time-outs happened in CLB.

Following is the CLB 504 error in the access log :

2022-xx-xx:xx:xx.xx servicename xx.xx.xx.xx:60450 - -1 -1 -1 504 0 157 0 "POST http://serviceurl:80/api  HTTP/1.1" "GuzzleHttp/x.5.5 curl/x.xx.0 xxx-1ubuntux.xx" - -

2022-xx-xx:xx:xx.xx servicename xx.xx.xx.xx:60450 xx.xx.xx.xx:80 0.000033 0.091067 0.000041 200 200 157 227 "POST http://serviceurl:80  HTTP/1.1" "GuzzleHttp/x.5.5 curl/x.xx.0 xxx-1ubuntux.xx" - -

Why does 504 Gateway time out occur in ELB?

Troubleshoot on AWS Elb Return 5xxs Error: 504 Gateway Timeout

(image source)

There are some possible causes that a CLB will return 504 error while targets are healthy:

a. The CLB idle timed out before target send it’s response.

b. The target closed connection before CLB has reached its idle timeout value.

Here can refer to the below [1], [2] for troubleshooting CLB 504 errors.

We checked your LB metrics, and can observed: - No request count spike at CLB. - No cpu utilization spike at targets. - Latency [3] was low (<1 second)

So from current information, it does’t seems like a CLB reached idle timeout issue.

Since the successful request only took 0.091067s for backend_processing_time [a], I do not think the 504 issue was caused by the scenario 1 at [b], which the backend didn’t send response after CLB exceeded idle timeout time.

Hence, it should to check the scenario 2 at [b], which the backend web server should enable keep-alive and the keep-alive timeout should greater than the CLB’s idle timeout (by default is 60 seconds) to avoid backend server close the connection before CLB.

Here can find the keep-alive configuration for NGNIX and Apache at [4], [5].

Finally, base on our Apache’s keep-alive timeout is 60s, it should be grather than CLB’s idle timeout, so we ajust the CLB’s idle timeout from 60s to 50s, then this should be fine.

[1] Troubleshoot a Classic Load Balancer: HTTP errors - HTTP 504: Gateway timeout https://docs.aws.amazon.com/elasticloadbalancing/latest/classic/ts-elb-error-message.html#ts-elb-errorcodes-http504

[2] How do I troubleshoot 504 errors returned while using a Classic Load Balancer? https://aws.amazon.com/premiumsupport/knowledge-center/504-error-classic/?nc1=h_ls

[3] CloudWatch metrics for your Classic Load Balancer https://docs.aws.amazon.com/elasticloadbalancing/latest/classic/elb-cloudwatch-metrics.html

[4] NGINX - Module ngx_http_upstream_module - keepalive_timeout http://nginx.org/en/docs/http/ngx_http_upstream_module.html#keepalive_timeout

[5] Apache - Apache Core Features - KeepAliveTimeout Directive https://httpd.apache.org/docs/2.4/en/mod/core.html#keepalivetimeout