INTG-STD-034v1.1.0MANDATORYINTEGRATIONstandard

Retry Policy

Purpose

Uncontrolled retries are a leading cause of cascading failures. When clients retry simultaneously with correlated timing, the resulting "thundering herd" overwhelms recovering services. This standard mandates exponential backoff with jitter, retry budgets, and idempotency requirements across all integration boundaries.

Companion standards: INTG-STD-033 (Resilience), INTG-STD-035 (Timeout).

Normative language follows RFC 2119 semantics.

Rules

R-1: Exponential Backoff with Full Jitter

All retry implementations MUST use exponential backoff with full jitter as the default algorithm:

sleep = random_between(0, min(cap, base * 2 ^ attempt))

base MUST default to 1 second
cap MUST default to 30 seconds
Decorrelated jitter MAY be used as an alternative
Equal jitter and fixed-interval retry MUST NOT be used

R-2: Maximum Retry Count

All retry implementations MUST enforce a maximum retry count.

Context	Default	Allowed Range
Synchronous API calls	3	1-5
Async event processing	5	1-10
Webhook delivery	5	3-8
Batch job items	3	1-5
gRPC unary calls	3	1-5

Exceeding the upper bound requires Integration Architecture Board approval.

R-3: Total Retry Duration

All retry loops MUST enforce a total duration cap regardless of retry count:

Synchronous API calls: MUST NOT exceed 30 seconds
Async events / webhooks: MUST NOT exceed 24 hours

Webhook retry schedule SHOULD use increasing intervals:

Attempt	1	2	3	4	5	6	7	8
Delay	0s	1s	5s	30s	2m	15m	1h	4h

After the final attempt, the message MUST be routed to a DLQ and an alert MUST fire.

R-4: Retryable vs Non-Retryable Classification

Implementations MUST classify failures before deciding whether to retry.

Retryable (MUST retry with backoff):

HTTP	gRPC	Network
408 Request Timeout	UNAVAILABLE (14)	Connection refused
429 Too Many Requests	DEADLINE_EXCEEDED (4)	Connection reset
500 Internal Server Error	RESOURCE_EXHAUSTED (8)	Socket timeout
502 Bad Gateway	ABORTED (10)	DNS failure (max 2 attempts)
503 Service Unavailable		TLS handshake timeout
504 Gateway Timeout

Non-retryable (MUST NOT retry):

HTTP	gRPC	Other
400 Bad Request	INVALID_ARGUMENT (3)	TLS certificate error
401 Unauthorized	NOT_FOUND (5)	Serialization error
403 Forbidden	PERMISSION_DENIED (7)	Invalid configuration
404 Not Found	UNAUTHENTICATED (16)	Token permanently revoked
409 Conflict	UNIMPLEMENTED (12)
422 Unprocessable Entity

R-5: Retry-After Compliance

When a response includes a Retry-After header (RFC 9110 Section 10.2.3):

The client MUST wait at least the specified duration
Retry-After MUST take precedence over calculated backoff when greater
If Retry-After exceeds remaining retry budget, the client MUST fail immediately

R-6: Idempotency Requirements

Retries MUST be safe. A retry is safe only when the operation is idempotent or protected by an idempotency mechanism.

Safe to retry without additional measures: GET, HEAD, OPTIONS, PUT, DELETE.

MUST use Idempotency-Key header for retries: POST, PATCH.

Idempotency key rules:

Client MUST generate a UUID v4 per logical operation
Same key MUST be reused across all retry attempts
Servers MUST store keys for minimum 24 hours and return cached responses for duplicates
Keys MUST be at most 64 characters

Event consumers MUST implement idempotent processing using a deduplication store keyed on event ID (minimum 7-day window).

R-7: Retry Budget

All services MUST enforce a retry budget to prevent amplification:

Maximum 20% of total requests MAY be retries over a rolling 30-second window
When budget is exhausted, retries MUST be suppressed and original error returned
Budget MUST be tracked per downstream dependency
Budget exhaustion SHOULD trigger circuit breaker open (INTG-STD-033)

R-8: Dead-Letter Queue Routing

For async integrations (events, webhooks, queues):

After retry exhaustion, messages MUST be routed to a DLQ
DLQ messages MUST retain original payload, headers, and retry metadata
DLQ messages MUST trigger an alert
Messages MUST NOT be silently dropped

R-9: Security

Retry logic MUST NOT introduce security vulnerabilities:

Retries MUST use the original auth context; if a token has expired, MUST refresh it before retrying — MUST NOT degrade to a lower-security credential or skip authentication.
Retry logs MUST NOT include request bodies, tokens, or PII.
TLS certificate errors MUST NOT be retried. A certificate error indicates a potential MITM (Man-in-the-Middle) attack — an adversary intercepting traffic and presenting a fraudulent certificate. Retrying would re-expose the request to the same compromised connection. The error MUST be surfaced immediately for investigation.
Retry budget (R-7) is mandatory to prevent DDoS amplification — uncontrolled retries from many clients simultaneously can amplify traffic to a degraded service, preventing its recovery.

R-10: Observability

Every retry attempt MUST be logged with: correlation_id, dependency, attempt, max_attempts, backoff_ms, error_type, idempotency_key.

Required metrics:

Metric	Type
`retry_attempts_total`	Counter (by service, dependency, attempt_number)
`retry_exhausted_total`	Counter (by service, dependency)
`retry_backoff_duration_seconds`	Histogram
`retry_budget_utilization_ratio`	Gauge
`dlq_messages_total`	Counter (by queue)

Examples

Retry with full jitter

function retry(operation, max_retries=3, base=1.0, cap=30.0, budget):
    deadline = now() + max_duration

    for attempt in 0..max_retries:
        result = operation()
        if result.success:
            return result

        if not is_retryable(result.error):
            fail(result.error)

        remaining = deadline - now()
        if attempt == max_retries or remaining <= 0:
            fail("retries exhausted", attempts=attempt+1)
        if not budget.may_retry():
            fail("retry budget exhausted")

        delay = min(random(0, min(cap, base * 2^attempt)), remaining)
        log(attempt=attempt+1, backoff_ms=delay*1000, error=result.error)
        wait(delay)

Idempotent retry on a non-idempotent operation

First attempt:

POST /v1/payments
Idempotency-Key: 7c4a8d09-ca95-4c6d-8f3b-91a7e6e0b9d2

{"amount": 100.00, "currency": "USD"}

Retry (same key - server returns cached response, no duplicate side effect):

POST /v1/payments
Idempotency-Key: 7c4a8d09-ca95-4c6d-8f3b-91a7e6e0b9d2

{"amount": 100.00, "currency": "USD"}

Enforcement Rules

Rule	Gate	Action
Fixed-interval retry detected	CI lint	Block merge
POST/PATCH retry without idempotency key	CI lint	Block merge
Retry on non-retryable status code	CI lint	Block merge
Missing retry budget	Production readiness	Block deploy
Max retries exceeds allowed range	Architecture review	IAB approval
DLQ not configured for async consumers	Production readiness	Block deploy

References

Rationale

Full jitter over alternatives: AWS research shows full jitter produces the least total work across competing clients. Equal jitter's guaranteed minimum floor creates clustering that partially defeats the purpose of jitter.

20% retry budget: Google's gRPC default. Without a budget, a service at 1,000 req/s with 50% failure and 3 retries amplifies to 2,500 req/s. At 20%, it stays at 1,200 req/s - manageable headroom for recovery.

Idempotency keys: Stripe's pattern makes retries safe at the protocol level without requiring retry logic to understand business semantics.

DLQ over infinite retry: Infinite retry causes unbounded queue growth, head-of-line blocking, and masks bugs. DLQ cleanly separates transient failures from problems needing human attention.

Version History

Version	Date	Change
1.0.0	2026-03-28	Initial definition
1.1.0	2026-04-10	R-9: defined MITM and DDoS amplification; converted numbered list to bullet format with inline rationale

Purpose​

Rules​

R-1: Exponential Backoff with Full Jitter​

R-2: Maximum Retry Count​

R-3: Total Retry Duration​

R-4: Retryable vs Non-Retryable Classification​

R-5: Retry-After Compliance​

R-6: Idempotency Requirements​

R-7: Retry Budget​

R-8: Dead-Letter Queue Routing​

R-9: Security​

R-10: Observability​

Examples​

Retry with full jitter​

Idempotent retry on a non-idempotent operation​

Enforcement Rules​

References​

Rationale​

Version History​