The byte order mark (BOM) is a Unicode character used to signal the endianness (byte order) of a text file or stream. Its code point is
U+FEFF. BOM use is optional, and, if used, should appear at the start of the text stream. Beyond its specific use as a byte-order indicator, the BOM character may also indicate which of the several Unicode representations the text is encoded in.
Because Unicode can be encoded as 16-bit or 32-bit integers, a computer receiving Unicode text from arbitrary sources needs to know which byte order the integers are encoded in. The BOM gives the producer of the text a way to describe the text stream's endianness to the consumer of the text without requiring some contract or metadata outside of the text stream itself. Once the receiving computer has consumed the text stream, it presumably processes the characters in its own native byte order and no longer needs the BOM. Hence the need for a BOM arises in the context of text interchange, rather than in normal text processing within a closed environment.
If the BOM character appears in the middle of a data stream, it should, according to Unicode, be interpreted as a "zero-width non-breaking space" (essentially a null character). Its deliberate use for this purpose is deprecated in Unicode 3.2, however, with the "Word Joiner" character,
U+2060, strongly preferred. This allows
U+FEFF to be used solely with the semantic of BOM.
While Unicode standard allows BOM in UTF-8 , it does not require or recommend it. Byte order has no meaning in UTF-8 so a BOM only serves to identify a text stream or file as UTF-8 or that it was converted from another format that has a BOM. Many Windows programs (including Windows Notepad) add BOMs to UTF-8 files by default. However in Unix-like systems (which make heavy use of text files for file formats as well as for inter-process communication) this practice will interfere with correct processing of important codes such as the shebang at the start of an interpreted script. The BOM will make a batch file not executable on Windows, so batch files must be saved as ANSI, not Unicode (although their native coding is DOS 437). On any platform, a UTF-8 BOM will interfere with the interpretation of source code for compiler and tools that don't recognise it but could otherwise handle UTF-8. For example in PHP, if output buffering is disabled, it has the subtle effect of causing the page to start being sent to the browser, preventing custom headers from being specified by the PHP script. The UTF-8 representation of the BOM is the byte sequence
0xEF,0xBB,0xBF, which appears as the ISO-8859-1 characters
ï»¿ in most text editors and web browsers not prepared to handle UTF-8.
Full article ▸