Skip to main content
INTG-STD-005v1.0.0MANDATORYINTEGRATIONstandard

Character Encoding

Purpose

This standard defines the REQUIRED character encoding for all text data exchanged across integration boundaries. Inconsistent encoding causes data corruption, security vulnerabilities, and integration failures. All integration surfaces MUST use UTF-8 with NFC normalization to guarantee round-trip fidelity and deterministic string comparison.

Normative language (MUST, MUST NOT, SHOULD, SHOULD NOT, MAY) follows RFC 2119 semantics.


Rules

R-1: UTF-8 as Sole Encoding

All text data crossing integration boundaries MUST be encoded in UTF-8 (RFC 3629).

  • UTF-16, UTF-32, ISO-8859-x, Windows-125x, Shift_JIS, and all other encodings MUST NOT be used at integration boundaries.
  • Internal representations MAY use other encodings, but MUST convert to UTF-8 before crossing any boundary.

R-2: Valid UTF-8 Sequences Only

All byte sequences MUST be well-formed per RFC 3629. The following MUST be rejected:

  • Overlong encodings (e.g., C0 80 for U+0000)
  • Surrogate halves (U+D800 through U+DFFF)
  • Code points beyond U+10FFFF
  • Truncated multi-byte sequences
  • Unexpected continuation bytes without a valid leading byte

R-3: NFC Normalization

All text MUST be normalized to NFC (RFC 5198) before transmission.

  • Producers MUST emit NFC-normalized text.
  • Consumers SHOULD verify NFC normalization on input.
  • String comparison at integration boundaries MUST compare NFC-normalized forms.
  • NFKC or NFKD MUST NOT be applied at integration boundaries - compatibility decomposition is lossy.

R-4: BOM (Byte Order Mark) Handling

The UTF-8 BOM (EF BB BF) MUST be handled per format:

FormatBOM RuleRationale
JSONMUST NOT be presentRFC 8259 forbids it
XMLMAY be present; parsers MUST tolerateXML spec permits BOM
CSVSHOULD be present for spreadsheet interopRequired for UTF-8 detection in common tools
Protocol Buffers / gRPCMUST NOT be presentBinary framing; BOM is meaningless
GraphQLMUST NOT be presentSpec treats BOM as insignificant
Plain text (logs, config)SHOULD NOT be presentBreaks concatenation and Unix tools
HTTP bodiesMUST NOT be present in JSON/API responsesRedundant with Content-Type header

If a BOM is encountered where forbidden, the system MUST strip it before processing and MAY log a warning.

R-5: Encoding Declaration

The encoding MUST be declared through the appropriate mechanism:

SurfaceDeclaration Mechanism
HTTP responsesContent-Type header with charset=utf-8
HTTP requestsContent-Type header MUST include charset=utf-8 for text payloads
XML documents<?xml version="1.0" encoding="UTF-8"?> MUST be present
HTML documents<meta charset="UTF-8"> MUST appear within first 1024 bytes
CSV filesUTF-8 BOM as first bytes; text/csv; charset=utf-8 in HTTP
Event payloadsSchema registry or envelope MUST declare encoding
Message queuesMessage metadata MUST declare charset=utf-8
gRPC / ProtobufImplicit - string type is defined as UTF-8
File transfersFilename convention or manifest MUST declare encoding

R-6: Rejection Policy

Systems receiving data at integration boundaries MUST validate encoding:

  • API gateways MUST reject non-UTF-8 payloads with HTTP 400 and a descriptive error.
  • Event consumers MUST route non-UTF-8 messages to a dead-letter queue and emit an alert.
  • File processors MUST reject non-UTF-8 files and log the detected encoding.
  • Batch jobs MUST fail the individual record (not the entire batch) and report violations in the summary.

Systems MUST NOT silently replace invalid bytes with U+FFFD at integration boundaries. Replacement characters are acceptable only for internal display or logging.

R-7: Database Storage

Integration-facing tables MUST use UTF-8-compatible character sets and collation. Notably, 3-byte "utf8" in MySQL MUST NOT be used - it cannot represent characters outside the Basic Multilingual Plane. Use the full 4-byte UTF-8 character set instead.

R-8: Special Character Handling

ContextRule
JSON stringsControl characters (U+0000 - U+001F) MUST be escaped per RFC 8259
XML contentSyntax-conflicting characters MUST use entities or CDATA; control chars other than TAB, LF, CR MUST NOT appear
URL parametersNon-ASCII characters MUST be percent-encoded after UTF-8 encoding (RFC 3986)
SQLParameterized queries MUST be used; string concatenation with user Unicode input MUST NOT be used
Log outputNon-printable characters SHOULD be escaped using \uXXXX notation

Examples

Valid: UTF-8 JSON with Encoding Declaration

A JSON payload containing multilingual text, served with the correct Content-Type header (application/json; charset=utf-8). All characters are valid UTF-8, NFC-normalized, and no BOM is present.

Invalid: Non-UTF-8 Payload

A payload containing bare 0xE9 (Latin-1 "e-acute") without valid UTF-8 continuation bytes. Rejected at the gateway with error code INVALID_ENCODING and the byte offset of the invalid sequence.


Enforcement Rules

ViolationActionError Code
Non-UTF-8 encoding detectedReject (HTTP 400 or equivalent)INVALID_ENCODING
Invalid UTF-8 byte sequenceReject (HTTP 400 or equivalent)INVALID_UTF8
BOM present in JSON payloadStrip, process, and log warningUNEXPECTED_BOM
Surrogate code pointReject (HTTP 400 or equivalent)SURROGATE_CODEPOINT
Non-character code pointReject (HTTP 400 or equivalent)NONCHARACTER
Forbidden control character (C0 other than TAB/LF/CR, or C1)Reject (HTTP 400 or equivalent)FORBIDDEN_CONTROL
NULL (U+0000) in JSON or XMLReject (HTTP 400 or equivalent)FORBIDDEN_CONTROL
Private-use code points without bilateral agreementReject or log warningPRIVATE_USE
Missing encoding declarationLog warning; assume UTF-8MISSING_CHARSET

Enforcement MUST occur at the outermost integration boundary (API gateway, message broker ingress, file intake). Interior services MAY rely on gateway validation.


Security Considerations

ThreatAttack VectorMitigation
Encoding injectionOverlong UTF-8 sequences bypass path/input filters (e.g., C0 AF encodes /)R-2 eliminates overlong sequences; R-6 rejects non-UTF-8 before application logic
Normalization bypassVisually identical strings with different byte representations bypass auth checksR-3 mandates NFC; comparison MUST use normalized forms
Homoglyph spoofingCharacters from different scripts appear identical (Latin "a" vs Cyrillic "a")Systems processing user-visible identifiers SHOULD apply confusable detection (UTS #39)
Null byte injectionEmbedded U+0000 truncates strings in C-based systems, enabling filter bypassEnforcement rules forbid NULL in text payloads

References


Rationale

UTF-8 exclusively: UTF-8 encodes every Unicode code point, is mandated by RFC 8259 (JSON) and BCP 18 (IETF protocols), is ASCII-compatible, self-synchronizing, and used by over 98% of web pages.

NFC over other forms: NFC is the most compact canonical form, preserves compatibility characters (unlike lossy NFKC/NFKD), and is recommended by both W3C and RFC 5198.

Per-format BOM rules: No single BOM policy fits all consumers - JSON forbids it (RFC 8259), while spreadsheet tools require it for reliable UTF-8 detection.


Version History

VersionDateChange
1.0.02026-03-28Initial definition