Search results
Results from the WOW.Com Content Network
UTF-16 arose from an earlier obsolete fixed-width 16-bit encoding now known as UCS-2 (for 2-byte Universal Character Set), [2] [3] once it became clear that more than 2 16 (65,536) code points were needed, [4] including most emoji and important CJK characters such as for personal and place names.
The system deliberately leaves many code points not assigned to characters, even in the BMP. It does this to allow for future expansion or to minimise conflicts with other encoding forms. The original edition of the UCS defined UTF-16, an extension of UCS-2, to represent code points outside the BMP. A range of code points in the S (Special ...
In Unicode and the UCS, a compatibility character is a character that is encoded solely to maintain round-trip convertibility with other, often older, standards. [1] As the Unicode Glossary says: A character that would not have been encoded except for compatibility and round-trip convertibility with other standards [2]
The Unicode Consortium and the ISO/IEC JTC 1/SC 2/WG 2 jointly collaborate on the list of the characters in the Universal Coded Character Set.The Universal Coded Character Set, most commonly called the Universal Character Set (abbr. UCS, official designation: ISO/IEC 10646), is an international standard to map characters, discrete symbols used in natural language, mathematics, music, and other ...
The numbers in the names of the encodings indicate the number of bits per code unit (for UTF encodings) or the number of bytes per code unit (for UCS encodings and UTF-1). UTF-8 and UTF-16 are the most commonly used encodings. UCS-2 is an obsolete subset of UTF-16; UCS-4 and UTF-32 are functionally equivalent. UTF encodings include:
To encode characters outside of the BMP (unreachable in plain UCS-2), such as Emoji, UTF-16 uses surrogate pairs, which when decoded with UCS-2 would appear as two valid but unmapped code points. A single SMS GSM message using this encoding can have at most 70 characters (140 octets).
As another example, if some text was created originally using the MacRoman character set, the left double quotation mark " will be represented with code point xD2. This will not display properly in a system expecting a document encoded as UTF-8, ISO 8859-1, or CP-1252, where this code point is occupied by the letter Ò.
In order to include these missing characters the 16-bit UTF-16 (in GSM called UCS-2) encoding may be used at the price of reducing the length of a (non-segmented) message from 160 to 70 characters. The messages in Chinese, Korean or Japanese languages must be encoded using the UTF-16 character encoding. The same was also true for other ...