Networking & Diagnostics16 min read

    What is a 504 Gateway Timeout — meaning, causes, and how to debug it

    Share

    The 504 Gateway Timeout meaning is precise, and worth getting right before reaching for any fix: a server acting as a gateway or proxy reached the upstream it was supposed to talk to, but the upstream did not finish responding within the gateway's configured time limit. The gateway gave up waiting and returned 504. The upstream might still be working on the request after the 504 has already been delivered to the client.

    A 504 is not "the internet is broken" and not "the site is down". It is a very specific signal — something further down the chain is slow — and the fix depends entirely on figuring out which "something" and why. This guide explains exactly what 504 means in HTTP terms, the four or five places it actually originates from in a real production stack, and how to debug each without falling into the classic trap of raising timeouts to make the symptom go away.


    What 504 actually means (in HTTP terms)

    The HTTP spec defines 504 as: "the server, while acting as a gateway or proxy, did not receive a timely response from an upstream server it needed to access in order to complete the request."

    The two words that matter:

    • Gateway / proxy — whoever returned the 504 was acting as a middleman, not as the origin. Nginx in front of PHP-FPM, an API gateway in front of microservices, Cloudflare in front of your origin, a load balancer in front of an app cluster.
    • Timely — there is a configured timeout. The upstream did not exceed any HTTP-level limit; it exceeded the gateway's patience.

    That has three immediate consequences:

    1. The 504 was returned by the gateway, not the application. Reading the application's log and looking for "what error did the app throw" is a dead end — there is no error; the app is probably still running the request.
    2. There is a specific timeout setting somewhere that decided how long was too long. Find that setting, and you have found the gateway.
    3. The fix is almost never "raise the timeout". The fix is "make the upstream respond faster, or accept that this request is async".

    How 504 differs from 500, 502, 503

    These four 5xx statuses are routinely confused, and the wrong fix wastes hours:

    Status What the gateway saw Most common cause
    500 Internal Server Error Upstream returned a response — and that response was an error Application threw an unhandled exception
    502 Bad Gateway Upstream returned an invalid response, or refused the connection Upstream process is down or unreachable
    503 Service Unavailable The server itself is intentionally refusing requests Maintenance mode; rate limit; explicit return 503
    504 Gateway Timeout Upstream connection succeeded; upstream just never replied in time Upstream is slow (long DB query, deadlock, saturated worker pool)

    If you are seeing 500 or 502 instead of 504, the diagnosis paths are different — see How to fix the 500 Internal Server Error in Nginx for the 500 path; for 502, the upstream process is the place to look (is it running? is it listening on the expected socket/port?).


    What users see vs what operators see

    A 504 is one of the more user-hostile error pages because the page itself is usually generic — "504 Gateway Time-out" — with no context. Most users assume the entire site is broken, when in reality only one slow endpoint timed out. The operator-side reality is usually one of these:

    • A single slow endpoint is timing out while everything else is fine.
    • A database query that used to be fast got slow (missing index, table grew, statistics stale).
    • A worker pool is saturated — there are no free PHP-FPM children / Node workers / Puma threads to take the request, so it queues until the gateway gives up.
    • An external dependency (a third-party API, a paid integration) became slow and the timeout is propagating.

    Distinguishing these takes about ten minutes if you read the right logs in the right order, which is the rest of this guide.


    Step 1 — Identify the gateway that returned the 504

    The most useful thing you can do in the first minute is figure out who sent the 504. The HTTP response headers tell you.

    curl -sv https://example.com/the-failing-path 2>&1 | head -40
    

    Look at the Server header and any vendor-specific headers in the response:

    Response header / pattern Likely gateway
    Server: nginx/1.x.x Your own Nginx returned the 504
    Server: cloudflare + cf-ray: header Cloudflare returned it (note: Cloudflare also has its own 524 — see below)
    Server: AmazonS3 / x-amz-* headers AWS infrastructure (CloudFront, ALB)
    Server: Google Frontend / via: 1.1 google Google Cloud Load Balancer / GFE
    Generic 504 page with no server header Often a CDN or WAF in front; check DNS to identify

    dig +short example.com and dig +short example.com NS also tell you whether the domain is fronted by a CDN. If the answer is a Cloudflare, Fastly, or CloudFront IP, that layer is your most likely 504 source.

    Cloudflare's 524 vs a real 504

    Cloudflare has a distinct status — 524 A Timeout Occurred — that fires after 100 seconds when the origin accepts the connection but doesn't reply. If your error page says 524, your origin is the slow one and Cloudflare is reporting it. A 504 in Cloudflare means an intermediate hop (often the origin's own gateway) returned 504 and Cloudflare passed it through. The fix path is the same — find the slow upstream — but the source attribution is different.


    Step 2 — Read the gateway's error log

    If your own Nginx (or HAProxy, Traefik, Envoy, etc.) is the gateway, read its log:

    sudo tail -f /var/log/nginx/error.log
    

    Reproduce the request. The error log line for a 504 is unmistakable:

    upstream timed out (110: Connection timed out) while reading response header from upstream
    

    Two things in that line matter:

    • while reading response header — Nginx connected to the upstream and was waiting for the first byte of the response. The upstream is slow to start responding.
    • while reading upstream — Nginx got the response headers but the upstream stalled mid-body. Often a streaming response, a long query that returns slowly, or an upstream that died after sending headers.
    • while connecting to upstream — Nginx couldn't even open the TCP connection. That is normally a 502, not 504; if you see a 504 with this message, an explicit proxy_connect_timeout is firing before Nginx gives up to a 502.

    Add the upstream timing fields to the access log if not already there — they instantly diagnose the difference between "the connection was slow" and "the response was slow":

    log_format upstream '$remote_addr "$request" $status '
                        'urt=$upstream_response_time '
                        'uct=$upstream_connect_time '
                        'uht=$upstream_header_time';
    access_log /var/log/nginx/access.log upstream;
    

    Reading the new fields after a 504:

    • uct near zero, uht close to your timeout → upstream accepted the connection fine, then took forever to send headers. The app is slow.
    • uct close to your timeout → upstream is overloaded and slow to accept connections. The app is saturated, not just slow.
    • urt - and uct - → Nginx never reached the upstream. This is actually a 502 misreported, or proxy_next_upstream is in play. Check upstream process state.

    Step 3 — Find the slow path in the upstream

    Once you've confirmed the upstream is the slow one, the question is what is it slow on. The order to investigate, in decreasing likelihood:

    3a. Slow database query

    By a wide margin the most common cause. The upstream is alive, the worker is busy, but the worker is blocked waiting on the database.

    For MySQL / MariaDB:

    SHOW FULL PROCESSLIST;
    -- Look for queries with high Time and State = "Sending data" / "Locked" / "Waiting for table metadata lock"
    

    Enable the slow query log temporarily:

    SET GLOBAL slow_query_log = 'ON';
    SET GLOBAL long_query_time = 1;
    SET GLOBAL slow_query_log_file = '/var/log/mysql/slow.log';
    

    For PostgreSQL:

    SELECT pid, now() - query_start AS duration, state, wait_event_type, wait_event, query
    FROM pg_stat_activity
    WHERE state != 'idle'
    ORDER BY duration DESC
    LIMIT 10;
    

    If you find a query that takes 30 seconds and your proxy_read_timeout is 60 seconds, you're one bad luck moment away from the 504 — and you've found your culprit. Almost always the fix is add an index, rewrite the query, or add caching — not raise the timeout.

    3b. Saturated worker pool

    The upstream is technically running but every worker is busy. New requests queue, the queue takes longer than the gateway timeout, 504.

    For PHP-FPM, look at the status page:

    location = /fpm-status {
        access_log off;
        allow 127.0.0.1;
        deny all;
        fastcgi_pass unix:/run/php/php8.x-fpm.sock;
        include fastcgi_params;
        fastcgi_param SCRIPT_NAME /fpm-status;
    }
    
    curl http://127.0.0.1/fpm-status
    # pool:                 www
    # active processes:     50      ← all busy
    # total processes:      50      ← pm.max_children
    # listen queue:         12      ← requests waiting
    # max listen queue:     128
    

    active processes == total processes and a non-zero listen queue is the saturation fingerprint. The fixes:

    1. Raise pm.max_children if you have RAM headroom. Estimate: (total RAM available to PHP) / (avg PHP-FPM RSS per worker). Aim for ~80% utilization at peak.
    2. Make individual requests faster. A slow worker is a busy worker. The fix in 3a usually frees the pool.
    3. Don't blindly raise pm.max_children without checking RAM — going past available memory causes OOM kills, which look like 502s and 500s.

    For Node (PM2), Python (gunicorn), Ruby (Puma), the equivalent is the worker count + concurrency setting (--workers, threads, cluster mode).

    3c. External dependency timeout

    The upstream is fast at its own work but blocked waiting for a third party — a payment gateway, a SaaS API, a remote storage service. This case has a distinctive log fingerprint: urt is high but the database is idle.

    Three durable fixes:

    1. Tight timeouts on outbound calls. Inside the app, every HTTP client call should have a timeout substantially shorter than the gateway timeout. If the gateway times out at 60 seconds, the app's outbound call should time out at 5–10 seconds, then fall back or surface a clean error.
    2. Move the call out of the request path. Background jobs, async workers, eventual consistency. The user-facing request returns a status; the slow work happens in a queue.
    3. Cache the result aggressively if the third-party data tolerates staleness.

    Raising the gateway timeout to "wait longer for the third party" makes one timeout problem worse instead of better — slow requests pile up and saturate the worker pool, leading to step 3b.

    3d. Deadlock, full GC, or stuck IO

    Less common but worth checking:

    • Database deadlockSHOW ENGINE INNODB STATUS\G (MySQL) for the latest deadlock report; pg_locks and pg_stat_activity (Postgres).
    • JVM-style full GC — for Java/Kotlin apps, a stop-the-world GC pause can easily exceed an HTTP timeout.
    • Disk IO saturationiostat -xz 1 on the upstream host. %util near 100% and high await mean the disk is the bottleneck (commonly: log flushing, full disk, noisy neighbour on shared storage).

    Step 4 — When raising the timeout is actually correct

    There is exactly one scenario where raising the gateway timeout is the right answer: the request legitimately takes longer than the current timeout, and that latency is acceptable to users.

    Examples: large-file downloads, CSV exports, server-rendered reports over big datasets, long-poll endpoints.

    In Nginx:

    location /reports/ {
        proxy_pass http://reports_backend;
        proxy_connect_timeout 5s;
        proxy_send_timeout    300s;
        proxy_read_timeout    300s;
    }
    

    Or for FastCGI / PHP-FPM:

    location ~ \.php$ {
        fastcgi_pass unix:/run/php/php8.x-fpm.sock;
        fastcgi_connect_timeout 5s;
        fastcgi_send_timeout    300s;
        fastcgi_read_timeout    300s;
        include fastcgi_params;
    }
    

    Two important rules:

    1. Scope the change to the location that needs it. A site-wide 5-minute timeout means a single hung request can hold a worker for five minutes. Slow endpoints get their own location block; everything else stays at sane defaults (30–60 s).
    2. Match it on the upstream side. PHP-FPM has its own request_terminate_timeout. Nginx's 300 s fastcgi_read_timeout does nothing if PHP-FPM kills its worker after 30 s. Set both, with PHP-FPM slightly higher than Nginx so Nginx sees the response before PHP-FPM gives up.

    Step 5 — Cloudflare, AWS, GCP — when the 504 is from a managed gateway

    If your traffic flows through a managed CDN or load balancer, that layer has its own timeout that you cannot raise indefinitely.

    Layer Timeout setting Default / max
    Cloudflare (free / pro) Connection idle to origin 100 s — returns 524 when exceeded
    Cloudflare Enterprise Origin response timeout Configurable up to ~6000 s
    AWS ALB Idle timeout 60 s default; configurable 1–4000 s
    AWS API Gateway (REST) Integration timeout Hard cap 29 s — cannot be raised
    AWS API Gateway (HTTP) Integration timeout Hard cap 30 s
    AWS CloudFront Origin response timeout 30 s default; up to 60 s default tier
    Google Cloud Load Balancer (HTTP/S) Backend response timeout 30 s default; configurable
    Vercel / Netlify Functions Execution timeout 10–60 s depending on plan

    If your upstream legitimately takes 60 seconds and you are behind AWS API Gateway, you cannot fix this with a timeout setting — you have hit a platform hard cap. The only fix is to make the work async: return 202 with a job ID, do the work in a background worker, expose a polling or webhook endpoint for the result.


    Operational tips

    • Always read $upstream_response_time along with the status code. A 504 with no upstream timing is a different bug than a 504 with a timing of "exactly your timeout setting".
    • A 504 right after a deploy is almost always a cold start or a missed cache. Migration ran, indexes weren't rebuilt, OPcache is empty, Redis cache is cold. Check whether the 504 rate normalises after 5–10 minutes; if it does, you are looking at warm-up cost, not a permanent regression.
    • Don't error_page 504 = 200. Hiding 504s from monitoring by mapping them to 200 is the equivalent of disabling smoke detectors. The on-call gets paged five hours later by an angry customer instead of immediately by the alert.
    • APM is the cheat code. Application performance monitoring (an APM agent inside the upstream) tells you which line of which function spent the time — the fastest path from "we got a 504" to "we know the slow query". Without an APM, you are reconstructing the same picture from logs and SHOW PROCESSLIST snapshots, which is doable but slower.
    • Put a hard timeout on every outbound HTTP call inside the upstream. Default Python requests and Node fetch have no timeout. A blocked outbound call propagates to a gateway 504 with no useful log line. Set timeouts everywhere and surface the timeout in your application log.
    • Watch for nested gateways. Browser → Cloudflare → ALB → Nginx → PHP-FPM is four gateways. Each has a timeout. If you raise Nginx's proxy_read_timeout to 120 s but the ALB's idle timeout is 60 s, the user still sees a 504 — from the ALB.

    Catch 504s before users do

    The painful failure mode for 504s is "a single endpoint is intermittently slow for two days, and the first signal is a customer churn". Generic / health checks won't catch it because the homepage is usually fast and cached. You need monitoring that exercises the actual slow endpoints with a tight latency budget.

    Sensible setup:

    • For each business-critical path (/api/checkout, /api/search, the user dashboard), set up a check that fails on any 5xx and on response time > a chosen budget (e.g. 3 s). A 504 will trip the status check; a slow-but-still-200 response will trip the latency check before it becomes a 504.
    • Run the check from multiple regions. Latency problems are often regional — a US-west origin can be fast for Americans and timing out for users in Asia long before anyone notices.
    • Alert on the response-time trend, not just on threshold crossings. A check that drifted from 200 ms to 2.5 s over a week is a future 504 you can prevent today.

    Xitoring's website monitoring runs HTTP checks from multiple regions, alerts on status, latency, and keyword/JSON assertions, and surfaces the trend so you see the slow drift before it becomes a timeout. Pair it with server monitoring on the upstream host so a CPU saturation or PHP-FPM listen queue backup is visible on the same dashboard as the 504 spike — the correlation usually answers "is this the network, the host, or the database?" without anyone having to SSH in.

    The Nginx error series continues with how to fix 500 Internal Server Error in Nginx and how to fix 403 Forbidden in Nginx. Different statuses, different causes, same diagnostic discipline: read the right log first, find the gateway second, fix the actual upstream third.


    Summary

    The 504 Gateway Timeout meaning, in one line: the gateway reached the upstream, the upstream did not reply in time, the gateway gave up.

    To debug it:

    1. Identify which gateway returned the 504. curl -sv for the response headers; CDN headers (cf-ray, via, x-amz-) tell you whether your own server, a CDN, or a load balancer is responsible.
    2. Read that gateway's error log. For Nginx, the line upstream timed out ... while reading response header is the smoking gun.
    3. Add upstream timing to the access log. $upstream_connect_time, $upstream_response_time, $upstream_header_time distinguish "slow to connect" (saturated upstream) from "slow to respond" (slow query).
    4. Check the database first. SHOW FULL PROCESSLIST (MySQL) / pg_stat_activity (Postgres). A slow query is the cause 60–70% of the time.
    5. Then check worker pool saturation. PHP-FPM /fpm-status, gunicorn / Puma worker counts. listen queue > 0 means saturation.
    6. Then check external dependencies — outbound API calls without tight timeouts will silently propagate slowness up to a 504.
    7. Raise the gateway timeout only when the work legitimately takes that long — and scope the change to one location, not the whole site. For long-running work, prefer making it async over chasing higher timeouts.
    8. Account for nested gateways. The lowest-set timeout in the chain is the one that fires.

    A 504 is data, not a verdict. Each cause has a fingerprint in the right log; the discipline that pays off is reading those logs in order instead of editing timeouts and hoping. Wire status + latency monitoring into your observability stack so the next slow drift is caught before it becomes a 504, and the next 504 is paged the moment it appears — not five hours later via the support inbox.