INTG-STD-035v1.1.0MANDATORYINTEGRATIONstandard

Timeout

Purpose

Every external call - whether to an API, database, message broker, or file system - MUST have an explicit timeout. Unbounded waits are the single most common cause of cascading failures in distributed systems. This standard defines mandatory timeout categories, default values by integration type, deadline propagation rules, and observability requirements. It complements INTG-STD-033 (Resilience) and INTG-STD-034 (Retry).

Rules

R-1: Explicit Timeout on Every External Call

Every outbound call to an external dependency MUST have an explicit timeout configured. This includes HTTP/REST, gRPC, database queries, message broker operations, file/object storage, DNS lookups, cache reads/writes, and any third-party SDK call performing network I/O. Relying on language or library default timeouts is NOT acceptable - many libraries default to infinite timeouts.

R-2: Separate Connection and Read Timeouts

Services MUST configure connection timeout and read timeout as independent values. Connection timeout governs TCP handshake completion; read timeout governs time waiting for the server response after the connection is established.

Connection timeout SHOULD follow: connection_timeout = round_trip_time * 3. For most intra-region calls, 2 seconds provides ample headroom. Read timeout MUST be based on measured downstream latency percentiles (p99.9 recommended), not guesswork.

R-3: Default Timeout Values by Integration Type

The following defaults MUST be used unless a documented exception exists with architectural approval:

Integration Type	Connection	Read	Total	Rationale
REST/HTTP API	2s	5s	10s	Most API calls complete within 1-2s at p99
gRPC (unary)	2s	5s	10s	Comparable to REST for request-response
gRPC (streaming)	2s	30s	300s	Streams require longer read windows
Database query	2s	3s	5s	Queries beyond 3s indicate missing indexes
Database transaction	2s	3s	10s	Multi-statement transactions need headroom
Message publish	2s	5s	10s	Broker acknowledgment should be fast
Message consume	2s	30s	60s	Long-poll patterns require extended waits
Cache (Redis/Memcache)	1s	1s	2s	Cache misses should fail fast
File/Object storage	5s	60s	120s	Large transfers need proportional budgets
SMTP/Email	5s	30s	60s	Mail servers vary widely in response time
DNS resolution	2s	N/A	2s	DNS should resolve from local cache
MCP tool invocation	2s	10s	15s	AI tool calls may involve upstream LLM calls
Webhook delivery	2s	5s	10s	Receiver should acknowledge quickly

Exceptions MUST be documented in the service's integration manifest with the dependency name, overridden value, justification, architecture team approval, and a review date no longer than 6 months out.

R-4: Deadline Propagation

Services that receive inbound requests with a deadline MUST propagate a reduced deadline to downstream calls:

downstream_deadline = incoming_deadline - elapsed_time - safety_margin

A safety margin of 100-500ms is RECOMMENDED. Protocol-specific mechanisms:

Protocol	Mechanism	Notes
gRPC	`grpc-timeout` header (automatic)	Framework propagates via context
HTTP	`X-Request-Deadline` header (epoch millis)	Application layer must read and propagate
Kafka	Message header `x-deadline` or record timestamp + TTL	Consumer checks before processing
GraphQL	`extensions.deadline` field	Resolver checks remaining budget per field

For gRPC, services MUST NOT create new contexts that discard the incoming deadline. For HTTP, services SHOULD include and honor the X-Request-Deadline header.

R-5: Deadline Budget Enforcement

A service MUST NOT initiate a downstream call if the remaining deadline budget is less than the minimum time required to complete it. This rule prevents "wasted work": starting a call that will certainly exceed its deadline consumes downstream resources (threads, connections, compute) without any possibility of the result being used — the upstream caller has already timed out or will before the response arrives. Failing fast with a budget-exhausted error is always preferable to starting a doomed call.

A service MUST return immediately with a timeout error, log the budget exhaustion event, and increment the timeout.budget_exhausted metric when remaining budget is insufficient.

function call_downstream(incoming_deadline, safety_margin, downstream_min):
    remaining = incoming_deadline - now() - safety_margin
    if remaining < downstream_min:
        log_warn("Deadline budget exhausted",
            remaining_ms=remaining, required_ms=downstream_min)
        emit_metric("timeout.budget_exhausted_total")
        return TIMEOUT_ERROR

    downstream_deadline = now() + remaining
    return call(downstream, deadline=downstream_deadline)

R-6: Protocol-Specific Timeout Rules

Services MUST apply the following protocol-specific rules:

Protocol	Rule	Severity
HTTP/REST	Return `408` when server terminates a slow client; `504` when gateway times out upstream	ERROR
HTTP/REST	TLS handshake timeout MUST be included in total timeout budget	ERROR
gRPC	Every RPC MUST have a deadline set; calls without deadlines are unbounded	ERROR
gRPC	Servers MUST check context cancellation and abort work on expired deadlines	ERROR
Kafka	`request.timeout.ms` on producer and `max.poll.interval.ms` on consumer MUST be set	ERROR
RabbitMQ	Consumer ack timeout MUST be configured; message TTL SHOULD be set	ERROR
SQS	`VisibilityTimeout` MUST be at least 6x expected processing time	ERROR
Database	`statement_timeout` and `idle_in_transaction_session_timeout` MUST be configured	ERROR
File/Object	Per-part upload timeout and stalled transfer detection (30s recommended) MUST be set	ERROR

R-7: Client-Side and Server-Side Timeouts

Both client and server MUST configure timeouts independently. Client-side timeouts protect against slow servers; server-side timeouts protect against slow or malicious clients.

The client-side timeout MUST be at most the server-side timeout for the same operation. If the client times out first, the server wastes resources processing a request whose result will be discarded.

R-8: Timeout and Circuit Breaker Interaction

Timeouts and circuit breakers (INTG-STD-033) MUST work together:

Each timeout event MUST increment the circuit breaker's failure counter. When the threshold is reached, the circuit MUST open.
When the circuit is open, requests MUST fail immediately without waiting for a timeout.
Half-open probe requests SHOULD use 50% of the normal timeout for faster degradation detection.

R-9: Security - Timeouts as Defense

Timeouts MUST defend against resource exhaustion attacks:

Slowloris prevention: Server read-header timeout MUST be 5s or less. Slowloris is a denial-of-service attack where an attacker opens many connections and sends HTTP headers very slowly (one byte at a time), keeping connections open indefinitely and exhausting the server's connection pool. A short read-header timeout forcibly closes stalled connections.
Slow POST prevention: Server MUST enforce a minimum data rate; connections transmitting less than 500 bytes/second for more than 10 seconds MUST be terminated. Slow POST is the body-phase equivalent of Slowloris — the attacker slowly dribbles POST body bytes to hold connections open. The 500 bytes/second threshold is intentionally strict: legitimate clients on any reasonable network exceed this rate. APIs receiving very small payloads (under 1 KB) effectively get this protection for free from their read timeout. APIs receiving large file uploads MAY use a higher byte budget but MUST document the exception.
Connection pool exhaustion: Idle connections beyond 120s SHOULD be closed to reclaim pool slots.
Query of death: Database statement timeouts MUST prevent single queries from monopolizing resources.

Services MUST NOT extend timeouts under load. Longer timeouts during overload consume more resources and accelerate cascading failures. The correct response to overload is to shed load via circuit breakers or rate limiting.

R-10: Timeout Metrics

Services MUST emit the following metrics for every timed external call:

Metric	Type	Labels
`external_call.duration_ms`	Histogram	`dependency`, `operation`, `result`
`external_call.timeout_total`	Counter	`dependency`, `operation`, `timeout_type`
`external_call.deadline_remaining_ms`	Histogram	`dependency`, `operation`
`timeout.budget_exhausted_total`	Counter	`dependency`, `operation`

Where timeout_type is one of: connection, read, write, total, deadline_exceeded. result is one of: success, timeout, error.

R-11: Timeout Logging

Every timeout event MUST be logged with at minimum: dependency, operation, timeout_type, configured_timeout_ms, and elapsed_ms. Additional RECOMMENDED fields: trace_id, span_id, deadline_remaining_ms, retry_attempt, circuit_breaker_state.

R-12: Timeout Alerting

Services MUST configure alerts for:

Condition	Severity	Action
Timeout rate above 5% for a dependency	WARNING	Investigate dependency health
Timeout rate above 20% for a dependency	CRITICAL	Trigger incident; circuit breaker should open
Budget exhaustion rate above 1%	WARNING	Review timeout budgets and call chain
p99 latency above 80% of configured timeout	WARNING	Timeout too tight or dependency is degrading

Enforcement Rules

The following MUST be enforced via static analysis, configuration scanning, or integration tests:

Rule	Check	Severity
TMO-001	Every HTTP client has explicit connection timeout	ERROR
TMO-002	Every HTTP client has explicit read timeout	ERROR
TMO-003	Connection timeout is at most 5s	ERROR
TMO-004	Read timeout is at most 30s (exceptions require approval)	WARNING
TMO-005	Total timeout is at most 120s (exceptions require approval)	WARNING
TMO-006	Database `statement_timeout` is configured	ERROR
TMO-007	Kafka `max.poll.interval.ms` is at most 300s	WARNING
TMO-008	No infinite or zero timeout values in config	ERROR
TMO-009	Server read-header timeout is at most 5s	ERROR
TMO-010	gRPC calls have deadline set	ERROR

Additional enforcement:

Gateway: API gateways MUST enforce a maximum total timeout; requests exceeding it receive 504.
Code review: PRs introducing new external calls MUST include timeout configuration.
Runtime: Services SHOULD log a warning when any call exceeds 80% of its configured timeout.

Examples

Deadline Propagation

The following pseudocode illustrates how a service receives an upstream deadline and propagates a reduced deadline to each downstream call:

function handle_request(request):
    deadline = parse_deadline_header(request)
    if deadline is null:
        deadline = now() + DEFAULT_TIMEOUT

    # Local processing
    result_a = fetch_from_service_a(request, deadline)

    # Recalculate remaining budget before next call
    remaining = deadline - now() - SAFETY_MARGIN
    if remaining < MIN_DOWNSTREAM_TIMEOUT:
        return error(408, "Deadline budget exhausted")

    result_b = fetch_from_service_b(request, deadline)
    return combine(result_a, result_b)

function fetch_from_service_a(request, upstream_deadline):
    remaining = upstream_deadline - now() - SAFETY_MARGIN
    if remaining < MIN_DOWNSTREAM_TIMEOUT:
        raise TimeoutError("Budget exhausted before calling Service A")

    return http_call(
        url = SERVICE_A_URL,
        timeout = remaining,
        headers = {"X-Request-Deadline": upstream_deadline}
    )

Rationale

Why separate connection and read timeouts? They measure different failure modes. Connection timeout detects network unreachability (host down); read timeout measures server processing speed. Conflating them either slows failure detection or sets unrealistic response expectations.

Why mandate deadline propagation? Without it, a 5-service chain with 10s timeouts per hop can block the originator for 50s - long after the client has disconnected - while downstream services continue wasted work.

Why aggressive defaults? Short timeouts force architectural discipline. A 3s database timeout surfaces missing indexes during development, not production incidents. Services needing longer timeouts can request documented exceptions.

Why not just circuit breakers? Timeouts bound a single call's duration; circuit breakers prevent repeated calls to failing dependencies. Without timeouts, circuit breakers have no signal for slow failures. Both are required; neither is sufficient alone.

Why never extend timeouts under load? Longer timeouts during overload mean more in-flight requests, more consumed resources, and faster cascading failure. The correct response is load shedding, not longer waits.

References

RFC 9110 - HTTP Semantics (408 Request Timeout, 504 Gateway Timeout)
Zalando Engineering - Timeouts - Connection timeout formula, latency percentile baselines
AWS Builders' Library - Timeouts, Retries, and Backoff - False-timeout rate, retry multiplication risks
gRPC Deadlines - Automatic deadline propagation, DEADLINE_EXCEEDED
INTG-STD-033 - Resilience Standard (circuit breakers, bulkheads, fallbacks)
INTG-STD-034 - Retry Standard (retry policies, backoff, idempotency)

Version History

Version	Date	Change
1.0.0	2026-03-28	Initial definition
1.1.0	2026-04-10	R-5: added justification for deadline budget enforcement; R-9: added explanation of Slowloris attack, Slow POST attack, and justification for the 500 bytes/s threshold

Purpose​

Rules​

R-1: Explicit Timeout on Every External Call​

R-2: Separate Connection and Read Timeouts​

R-3: Default Timeout Values by Integration Type​

R-4: Deadline Propagation​

R-5: Deadline Budget Enforcement​

R-6: Protocol-Specific Timeout Rules​

R-7: Client-Side and Server-Side Timeouts​

R-8: Timeout and Circuit Breaker Interaction​

R-9: Security - Timeouts as Defense​

R-10: Timeout Metrics​

R-11: Timeout Logging​

R-12: Timeout Alerting​

Enforcement Rules​

Examples​

Deadline Propagation​

Rationale​

References​

Version History​