Byte-order mark

related topics
{math, number, function}
{system, computer, user}

The byte order mark (BOM) is a Unicode character used to signal the endianness (byte order) of a text file or stream. Its code point is U+FEFF. BOM use is optional, and, if used, should appear at the start of the text stream. Beyond its specific use as a byte-order indicator, the BOM character may also indicate which of the several Unicode representations the text is encoded in.[1]

Because Unicode can be encoded as 16-bit or 32-bit integers, a computer receiving Unicode text from arbitrary sources needs to know which byte order the integers are encoded in. The BOM gives the producer of the text a way to describe the text stream's endianness to the consumer of the text without requiring some contract or metadata outside of the text stream itself. Once the receiving computer has consumed the text stream, it presumably processes the characters in its own native byte order and no longer needs the BOM. Hence the need for a BOM arises in the context of text interchange, rather than in normal text processing within a closed environment.

Contents

Usage

If the BOM character appears in the middle of a data stream, it should, according to Unicode, be interpreted as a "zero-width non-breaking space" (essentially a null character). Its deliberate use for this purpose is deprecated in Unicode 3.2, however, with the "Word Joiner" character, U+2060, strongly preferred.[1] This allows U+FEFF to be used solely with the semantic of BOM.

UTF-8

While Unicode standard allows BOM in UTF-8 [2], it does not require or recommend it[3]. Byte order has no meaning in UTF-8[4] so a BOM only serves to identify a text stream or file as UTF-8 or that it was converted from another format that has a BOM. Many Windows programs (including Windows Notepad) add BOMs to UTF-8 files by default. However in Unix-like systems (which make heavy use of text files for file formats as well as for inter-process communication) this practice will interfere with correct processing of important codes such as the shebang at the start of an interpreted script.[5] The BOM will make a batch file not executable on Windows, so batch files must be saved as ANSI, not Unicode (although their native coding is DOS 437). On any platform, a UTF-8 BOM will interfere with the interpretation of source code for compiler and tools that don't recognise it but could otherwise handle UTF-8. For example in PHP, if output buffering is disabled, it has the subtle effect of causing the page to start being sent to the browser, preventing custom headers from being specified by the PHP script. The UTF-8 representation of the BOM is the byte sequence 0xEF,0xBB,0xBF, which appears as the ISO-8859-1 characters  in most text editors and web browsers not prepared to handle UTF-8.

Full article ▸

related documents
Algebraic extension
Extractor
Residue (complex analysis)
Malleability (cryptography)
Partial function
Commutative ring
Calculus with polynomials
Iteration
Zeta distribution
Geometric mean
Steiner system
Static code analysis
Degenerate distribution
Subtraction
Pseudometric space
Bernoulli process
Fibonacci coding
Waring's problem
Heap (data structure)
Bijection
Differential cryptanalysis
Alexandroff extension
Abstract factory pattern
Hash collision
Weierstrass–Casorati theorem
Residue theorem
Domain (mathematics)
Double negative elimination
Borel-Cantelli lemma
Alternating group