Skip to main content
INTG-STD-035v1.1.0MANDATORYINTEGRATIONstandard

Timeout

Purpose

Every external call - whether to an API, database, message broker, or file system - MUST have an explicit timeout. Unbounded waits are the single most common cause of cascading failures in distributed systems. This standard defines mandatory timeout categories, default values by integration type, deadline propagation rules, and observability requirements. It complements INTG-STD-033 (Resilience) and INTG-STD-034 (Retry).

Rules

R-1: Explicit Timeout on Every External Call

Every outbound call to an external dependency MUST have an explicit timeout configured. This includes HTTP/REST, gRPC, database queries, message broker operations, file/object storage, DNS lookups, cache reads/writes, and any third-party SDK call performing network I/O. Relying on language or library default timeouts is NOT acceptable - many libraries default to infinite timeouts.

R-2: Separate Connection and Read Timeouts

Services MUST configure connection timeout and read timeout as independent values. Connection timeout governs TCP handshake completion; read timeout governs time waiting for the server response after the connection is established.

Connection timeout SHOULD follow: connection_timeout = round_trip_time * 3. For most intra-region calls, 2 seconds provides ample headroom. Read timeout MUST be based on measured downstream latency percentiles (p99.9 recommended), not guesswork.

R-3: Default Timeout Values by Integration Type

The following defaults MUST be used unless a documented exception exists with architectural approval:

Integration TypeConnectionReadTotalRationale
REST/HTTP API2s5s10sMost API calls complete within 1-2s at p99
gRPC (unary)2s5s10sComparable to REST for request-response
gRPC (streaming)2s30s300sStreams require longer read windows
Database query2s3s5sQueries beyond 3s indicate missing indexes
Database transaction2s3s10sMulti-statement transactions need headroom
Message publish2s5s10sBroker acknowledgment should be fast
Message consume2s30s60sLong-poll patterns require extended waits
Cache (Redis/Memcache)1s1s2sCache misses should fail fast
File/Object storage5s60s120sLarge transfers need proportional budgets
SMTP/Email5s30s60sMail servers vary widely in response time
DNS resolution2sN/A2sDNS should resolve from local cache
MCP tool invocation2s10s15sAI tool calls may involve upstream LLM calls
Webhook delivery2s5s10sReceiver should acknowledge quickly

Exceptions MUST be documented in the service's integration manifest with the dependency name, overridden value, justification, architecture team approval, and a review date no longer than 6 months out.

R-4: Deadline Propagation

Services that receive inbound requests with a deadline MUST propagate a reduced deadline to downstream calls:

downstream_deadline = incoming_deadline - elapsed_time - safety_margin

A safety margin of 100-500ms is RECOMMENDED. Protocol-specific mechanisms:

ProtocolMechanismNotes
gRPCgrpc-timeout header (automatic)Framework propagates via context
HTTPX-Request-Deadline header (epoch millis)Application layer must read and propagate
KafkaMessage header x-deadline or record timestamp + TTLConsumer checks before processing
GraphQLextensions.deadline fieldResolver checks remaining budget per field

For gRPC, services MUST NOT create new contexts that discard the incoming deadline. For HTTP, services SHOULD include and honor the X-Request-Deadline header.

R-5: Deadline Budget Enforcement

A service MUST NOT initiate a downstream call if the remaining deadline budget is less than the minimum time required to complete it. This rule prevents "wasted work": starting a call that will certainly exceed its deadline consumes downstream resources (threads, connections, compute) without any possibility of the result being used — the upstream caller has already timed out or will before the response arrives. Failing fast with a budget-exhausted error is always preferable to starting a doomed call.

A service MUST return immediately with a timeout error, log the budget exhaustion event, and increment the timeout.budget_exhausted metric when remaining budget is insufficient.

function call_downstream(incoming_deadline, safety_margin, downstream_min):
remaining = incoming_deadline - now() - safety_margin
if remaining < downstream_min:
log_warn("Deadline budget exhausted",
remaining_ms=remaining, required_ms=downstream_min)
emit_metric("timeout.budget_exhausted_total")
return TIMEOUT_ERROR

downstream_deadline = now() + remaining
return call(downstream, deadline=downstream_deadline)

R-6: Protocol-Specific Timeout Rules

Services MUST apply the following protocol-specific rules:

ProtocolRuleSeverity
HTTP/RESTReturn 408 when server terminates a slow client; 504 when gateway times out upstreamERROR
HTTP/RESTTLS handshake timeout MUST be included in total timeout budgetERROR
gRPCEvery RPC MUST have a deadline set; calls without deadlines are unboundedERROR
gRPCServers MUST check context cancellation and abort work on expired deadlinesERROR
Kafkarequest.timeout.ms on producer and max.poll.interval.ms on consumer MUST be setERROR
RabbitMQConsumer ack timeout MUST be configured; message TTL SHOULD be setERROR
SQSVisibilityTimeout MUST be at least 6x expected processing timeERROR
Databasestatement_timeout and idle_in_transaction_session_timeout MUST be configuredERROR
File/ObjectPer-part upload timeout and stalled transfer detection (30s recommended) MUST be setERROR

R-7: Client-Side and Server-Side Timeouts

Both client and server MUST configure timeouts independently. Client-side timeouts protect against slow servers; server-side timeouts protect against slow or malicious clients.

The client-side timeout MUST be at most the server-side timeout for the same operation. If the client times out first, the server wastes resources processing a request whose result will be discarded.

R-8: Timeout and Circuit Breaker Interaction

Timeouts and circuit breakers (INTG-STD-033) MUST work together:

  1. Each timeout event MUST increment the circuit breaker's failure counter. When the threshold is reached, the circuit MUST open.
  2. When the circuit is open, requests MUST fail immediately without waiting for a timeout.
  3. Half-open probe requests SHOULD use 50% of the normal timeout for faster degradation detection.

R-9: Security - Timeouts as Defense

Timeouts MUST defend against resource exhaustion attacks:

  • Slowloris prevention: Server read-header timeout MUST be 5s or less. Slowloris is a denial-of-service attack where an attacker opens many connections and sends HTTP headers very slowly (one byte at a time), keeping connections open indefinitely and exhausting the server's connection pool. A short read-header timeout forcibly closes stalled connections.
  • Slow POST prevention: Server MUST enforce a minimum data rate; connections transmitting less than 500 bytes/second for more than 10 seconds MUST be terminated. Slow POST is the body-phase equivalent of Slowloris — the attacker slowly dribbles POST body bytes to hold connections open. The 500 bytes/second threshold is intentionally strict: legitimate clients on any reasonable network exceed this rate. APIs receiving very small payloads (under 1 KB) effectively get this protection for free from their read timeout. APIs receiving large file uploads MAY use a higher byte budget but MUST document the exception.
  • Connection pool exhaustion: Idle connections beyond 120s SHOULD be closed to reclaim pool slots.
  • Query of death: Database statement timeouts MUST prevent single queries from monopolizing resources.

Services MUST NOT extend timeouts under load. Longer timeouts during overload consume more resources and accelerate cascading failures. The correct response to overload is to shed load via circuit breakers or rate limiting.

R-10: Timeout Metrics

Services MUST emit the following metrics for every timed external call:

MetricTypeLabels
external_call.duration_msHistogramdependency, operation, result
external_call.timeout_totalCounterdependency, operation, timeout_type
external_call.deadline_remaining_msHistogramdependency, operation
timeout.budget_exhausted_totalCounterdependency, operation

Where timeout_type is one of: connection, read, write, total, deadline_exceeded. result is one of: success, timeout, error.

R-11: Timeout Logging

Every timeout event MUST be logged with at minimum: dependency, operation, timeout_type, configured_timeout_ms, and elapsed_ms. Additional RECOMMENDED fields: trace_id, span_id, deadline_remaining_ms, retry_attempt, circuit_breaker_state.

R-12: Timeout Alerting

Services MUST configure alerts for:

ConditionSeverityAction
Timeout rate above 5% for a dependencyWARNINGInvestigate dependency health
Timeout rate above 20% for a dependencyCRITICALTrigger incident; circuit breaker should open
Budget exhaustion rate above 1%WARNINGReview timeout budgets and call chain
p99 latency above 80% of configured timeoutWARNINGTimeout too tight or dependency is degrading

Enforcement Rules

The following MUST be enforced via static analysis, configuration scanning, or integration tests:

RuleCheckSeverity
TMO-001Every HTTP client has explicit connection timeoutERROR
TMO-002Every HTTP client has explicit read timeoutERROR
TMO-003Connection timeout is at most 5sERROR
TMO-004Read timeout is at most 30s (exceptions require approval)WARNING
TMO-005Total timeout is at most 120s (exceptions require approval)WARNING
TMO-006Database statement_timeout is configuredERROR
TMO-007Kafka max.poll.interval.ms is at most 300sWARNING
TMO-008No infinite or zero timeout values in configERROR
TMO-009Server read-header timeout is at most 5sERROR
TMO-010gRPC calls have deadline setERROR

Additional enforcement:

  • Gateway: API gateways MUST enforce a maximum total timeout; requests exceeding it receive 504.
  • Code review: PRs introducing new external calls MUST include timeout configuration.
  • Runtime: Services SHOULD log a warning when any call exceeds 80% of its configured timeout.

Examples

Deadline Propagation

The following pseudocode illustrates how a service receives an upstream deadline and propagates a reduced deadline to each downstream call:

function handle_request(request):
deadline = parse_deadline_header(request)
if deadline is null:
deadline = now() + DEFAULT_TIMEOUT

# Local processing
result_a = fetch_from_service_a(request, deadline)

# Recalculate remaining budget before next call
remaining = deadline - now() - SAFETY_MARGIN
if remaining < MIN_DOWNSTREAM_TIMEOUT:
return error(408, "Deadline budget exhausted")

result_b = fetch_from_service_b(request, deadline)
return combine(result_a, result_b)

function fetch_from_service_a(request, upstream_deadline):
remaining = upstream_deadline - now() - SAFETY_MARGIN
if remaining < MIN_DOWNSTREAM_TIMEOUT:
raise TimeoutError("Budget exhausted before calling Service A")

return http_call(
url = SERVICE_A_URL,
timeout = remaining,
headers = {"X-Request-Deadline": upstream_deadline}
)

Rationale

Why separate connection and read timeouts? They measure different failure modes. Connection timeout detects network unreachability (host down); read timeout measures server processing speed. Conflating them either slows failure detection or sets unrealistic response expectations.

Why mandate deadline propagation? Without it, a 5-service chain with 10s timeouts per hop can block the originator for 50s - long after the client has disconnected - while downstream services continue wasted work.

Why aggressive defaults? Short timeouts force architectural discipline. A 3s database timeout surfaces missing indexes during development, not production incidents. Services needing longer timeouts can request documented exceptions.

Why not just circuit breakers? Timeouts bound a single call's duration; circuit breakers prevent repeated calls to failing dependencies. Without timeouts, circuit breakers have no signal for slow failures. Both are required; neither is sufficient alone.

Why never extend timeouts under load? Longer timeouts during overload mean more in-flight requests, more consumed resources, and faster cascading failure. The correct response is load shedding, not longer waits.

References

Version History

VersionDateChange
1.0.02026-03-28Initial definition
1.1.02026-04-10R-5: added justification for deadline budget enforcement; R-9: added explanation of Slowloris attack, Slow POST attack, and justification for the 500 bytes/s threshold