UTF-16 (16-bit Unicode Transformation Format) is a character encoding for Unicode capable of encoding 1,112,064 numbers (called code points) in the Unicode code space from 0 to 0x10FFFF. It produces a variable-length result of either one or two 16-bit code units per code point.
The older UCS-2 (2-byte Universal Character Set) is a similar character encoding that was superseded by UTF-16 in version 2.0 of the Unicode standard in July 1996.. It produces a fixed-length format by simply using the code point as the 16-bit code unit and produces exactly the same result as UTF-16 for 63,488 code points in the range 0-0xFFFF, including all characters that had been assigned a value in this range at that time.
UTF-16 is officially defined in Annex Q of the international standard ISO/IEC 10646-1. It is also described in "The Unicode Standard" version 2.0 and higher, as well as in the IETF's RFC 2781.
Code points U+0000..U+D7FF and U+E000..U+FFFF
For these code points both UTF-16 and UCS-2 use a single 16-bit code value that is equal (numerically) to the code point. This group of code points is named the Basic Multilingual Plane or BMP.
Code points U+10000..U+10FFFF
Code points larger than 0xFFFF are called supplementary code points or the Supplementary Planes
It is not possible to encode these code points in UCS-2.
UTF-16 converts these into two 16-bit code points, called a surrogate pair, by the following scheme:
- 0x10000 is subtracted from the code point, leaving a 20 bit number in the range 0..0xFFFFF.
- The top ten bits (a number in the range 0..0x3FF) are added to 0xD800 to give the first code point or high surrogate, which will be in the range 0xD800..0xDBFF.
- The low ten bits (also in the range 0..0x3FF) are added to 0xDC00 to give the second code point or low surrogate, which will be in the range 0xDC00..0xDFFF.
Full article ▸