Skip to main content
INTG-STD-034v1.1.0MANDATORYINTEGRATIONstandard

Retry Policy

Purpose

Uncontrolled retries are a leading cause of cascading failures. When clients retry simultaneously with correlated timing, the resulting "thundering herd" overwhelms recovering services. This standard mandates exponential backoff with jitter, retry budgets, and idempotency requirements across all integration boundaries.

Companion standards: INTG-STD-033 (Resilience), INTG-STD-035 (Timeout).

Normative language follows RFC 2119 semantics.

Rules

R-1: Exponential Backoff with Full Jitter

All retry implementations MUST use exponential backoff with full jitter as the default algorithm:

sleep = random_between(0, min(cap, base * 2 ^ attempt))
  • base MUST default to 1 second
  • cap MUST default to 30 seconds
  • Decorrelated jitter MAY be used as an alternative
  • Equal jitter and fixed-interval retry MUST NOT be used

R-2: Maximum Retry Count

All retry implementations MUST enforce a maximum retry count.

ContextDefaultAllowed Range
Synchronous API calls31-5
Async event processing51-10
Webhook delivery53-8
Batch job items31-5
gRPC unary calls31-5

Exceeding the upper bound requires Integration Architecture Board approval.

R-3: Total Retry Duration

All retry loops MUST enforce a total duration cap regardless of retry count:

  • Synchronous API calls: MUST NOT exceed 30 seconds
  • Async events / webhooks: MUST NOT exceed 24 hours

Webhook retry schedule SHOULD use increasing intervals:

Attempt12345678
Delay0s1s5s30s2m15m1h4h

After the final attempt, the message MUST be routed to a DLQ and an alert MUST fire.

R-4: Retryable vs Non-Retryable Classification

Implementations MUST classify failures before deciding whether to retry.

Retryable (MUST retry with backoff):

HTTPgRPCNetwork
408 Request TimeoutUNAVAILABLE (14)Connection refused
429 Too Many RequestsDEADLINE_EXCEEDED (4)Connection reset
500 Internal Server ErrorRESOURCE_EXHAUSTED (8)Socket timeout
502 Bad GatewayABORTED (10)DNS failure (max 2 attempts)
503 Service UnavailableTLS handshake timeout
504 Gateway Timeout

Non-retryable (MUST NOT retry):

HTTPgRPCOther
400 Bad RequestINVALID_ARGUMENT (3)TLS certificate error
401 UnauthorizedNOT_FOUND (5)Serialization error
403 ForbiddenPERMISSION_DENIED (7)Invalid configuration
404 Not FoundUNAUTHENTICATED (16)Token permanently revoked
409 ConflictUNIMPLEMENTED (12)
422 Unprocessable Entity

R-5: Retry-After Compliance

When a response includes a Retry-After header (RFC 9110 Section 10.2.3):

  1. The client MUST wait at least the specified duration
  2. Retry-After MUST take precedence over calculated backoff when greater
  3. If Retry-After exceeds remaining retry budget, the client MUST fail immediately

R-6: Idempotency Requirements

Retries MUST be safe. A retry is safe only when the operation is idempotent or protected by an idempotency mechanism.

Safe to retry without additional measures: GET, HEAD, OPTIONS, PUT, DELETE.

MUST use Idempotency-Key header for retries: POST, PATCH.

Idempotency key rules:

  • Client MUST generate a UUID v4 per logical operation
  • Same key MUST be reused across all retry attempts
  • Servers MUST store keys for minimum 24 hours and return cached responses for duplicates
  • Keys MUST be at most 64 characters

Event consumers MUST implement idempotent processing using a deduplication store keyed on event ID (minimum 7-day window).

R-7: Retry Budget

All services MUST enforce a retry budget to prevent amplification:

  • Maximum 20% of total requests MAY be retries over a rolling 30-second window
  • When budget is exhausted, retries MUST be suppressed and original error returned
  • Budget MUST be tracked per downstream dependency
  • Budget exhaustion SHOULD trigger circuit breaker open (INTG-STD-033)

R-8: Dead-Letter Queue Routing

For async integrations (events, webhooks, queues):

  1. After retry exhaustion, messages MUST be routed to a DLQ
  2. DLQ messages MUST retain original payload, headers, and retry metadata
  3. DLQ messages MUST trigger an alert
  4. Messages MUST NOT be silently dropped

R-9: Security

Retry logic MUST NOT introduce security vulnerabilities:

  • Retries MUST use the original auth context; if a token has expired, MUST refresh it before retrying — MUST NOT degrade to a lower-security credential or skip authentication.
  • Retry logs MUST NOT include request bodies, tokens, or PII.
  • TLS certificate errors MUST NOT be retried. A certificate error indicates a potential MITM (Man-in-the-Middle) attack — an adversary intercepting traffic and presenting a fraudulent certificate. Retrying would re-expose the request to the same compromised connection. The error MUST be surfaced immediately for investigation.
  • Retry budget (R-7) is mandatory to prevent DDoS amplification — uncontrolled retries from many clients simultaneously can amplify traffic to a degraded service, preventing its recovery.

R-10: Observability

Every retry attempt MUST be logged with: correlation_id, dependency, attempt, max_attempts, backoff_ms, error_type, idempotency_key.

Required metrics:

MetricType
retry_attempts_totalCounter (by service, dependency, attempt_number)
retry_exhausted_totalCounter (by service, dependency)
retry_backoff_duration_secondsHistogram
retry_budget_utilization_ratioGauge
dlq_messages_totalCounter (by queue)

Examples

Retry with full jitter

function retry(operation, max_retries=3, base=1.0, cap=30.0, budget):
deadline = now() + max_duration

for attempt in 0..max_retries:
result = operation()
if result.success:
return result

if not is_retryable(result.error):
fail(result.error)

remaining = deadline - now()
if attempt == max_retries or remaining <= 0:
fail("retries exhausted", attempts=attempt+1)
if not budget.may_retry():
fail("retry budget exhausted")

delay = min(random(0, min(cap, base * 2^attempt)), remaining)
log(attempt=attempt+1, backoff_ms=delay*1000, error=result.error)
wait(delay)

Idempotent retry on a non-idempotent operation

First attempt:

POST /v1/payments
Idempotency-Key: 7c4a8d09-ca95-4c6d-8f3b-91a7e6e0b9d2

{"amount": 100.00, "currency": "USD"}

Retry (same key - server returns cached response, no duplicate side effect):

POST /v1/payments
Idempotency-Key: 7c4a8d09-ca95-4c6d-8f3b-91a7e6e0b9d2

{"amount": 100.00, "currency": "USD"}

Enforcement Rules

RuleGateAction
Fixed-interval retry detectedCI lintBlock merge
POST/PATCH retry without idempotency keyCI lintBlock merge
Retry on non-retryable status codeCI lintBlock merge
Missing retry budgetProduction readinessBlock deploy
Max retries exceeds allowed rangeArchitecture reviewIAB approval
DLQ not configured for async consumersProduction readinessBlock deploy

References

Rationale

Full jitter over alternatives: AWS research shows full jitter produces the least total work across competing clients. Equal jitter's guaranteed minimum floor creates clustering that partially defeats the purpose of jitter.

20% retry budget: Google's gRPC default. Without a budget, a service at 1,000 req/s with 50% failure and 3 retries amplifies to 2,500 req/s. At 20%, it stays at 1,200 req/s - manageable headroom for recovery.

Idempotency keys: Stripe's pattern makes retries safe at the protocol level without requiring retry logic to understand business semantics.

DLQ over infinite retry: Infinite retry causes unbounded queue growth, head-of-line blocking, and masks bugs. DLQ cleanly separates transient failures from problems needing human attention.

Version History

VersionDateChange
1.0.02026-03-28Initial definition
1.1.02026-04-10R-9: defined MITM and DDoS amplification; converted numbered list to bullet format with inline rationale