Basic Unicode encoding concepts (technical)

Version 1.2, 13 August 2021

Version history:

1.0, 12 November 2015
1.1, 28 May 2018
1.2, 13 August 2021

Introduction

Some of the most basic technical concepts used when talking about Unicode encoding of text are to be found in chapter 2, General Structure, of the Unicode Standard [currently (August of 2021) at version 13.0; abbreviated here as ‘TUS’; other parts of TUS may also be referenced]. We have noticed that some links in software chains make incorrect assumptions about the encoding of data, resulting in data loss and data corruption. This document points to vital information and it is henceforth assumed that all service providers and software vendors take particular care in meticulously addressing these aspects of the Unicode Standard, while adhering to and implementing this international standard in its totality. The following paragraphs give only rough and imprecise descriptions of some important low-level concepts which must be considered when debugging a faulty data flow and/or conversion: software implementers, text processing and hosting service providers, web and CSS designers, etc., must all refer to the full text of the Unicode Standard in order to reach a proper understanding.

Characters, code points

Unicode encodes characters, such as ‘c’ and ‘ç’: these may be represented by one or more code points. For instance, ‘ç’ can be encoded using the single U+00E7 code point (‘U+’ indicating that what follows immediately is a Unicode hexadecimal scalar value); but also as a combination of ‘c’ character and the combining cedilla character ‘◌̧’ (U+0063.0327, being the combination of hexadecimal 0063 and hexadecimal 0327, with the dot indicating concatenation of the two values).

[TUS Ch. 2.1, 2.2, 2.4]

Encoding forms, bits, bytes, encoding schemes, byte order mark

In the Unicode character encoding model, precisely defined encoding forms specify how each integer (code point) for a Unicode character is to be expressed as a sequence of one or more code units. The Unicode Standard provides three distinct encoding forms for Unicode characters, using 8-bit, 16-bit, and 32-bit units. These are named UTF-8, UTF-16, and UTF-32, respectively. They are all equally valid.

[TUS Ch. 2.5]

Interchange of textual data, particularly between computers of different architectural types, requires consideration of the exact ordering of the bits and bytes in numeric representation. In a numeric data type of more than one byte, for instance, the byte order can be ‘big-endian’ (most significant byte first in internal representation) or ‘little-endian’ (least significant byte first in internal representation). An initial byte order mark (BOM: U+FEFF) can explicitly differentiate the big-endian or little-endian data in some of the Unicode encoding schemes.

For UTF-8, the encoding scheme consists merely of the UTF-8 code units (= bytes) in sequence. In this scheme, there is no issue of big- versus little-endian byte order. Use of a BOM is neither required nor recommended for UTF-8, but may be encountered in contexts where UTF-8 data is converted from other encoding forms that use a BOM, or where a BOM is used as a UTF-8 signature. Both UTF-16 and UTF-32 can be big-endian or little-endian.

[TUS Ch. 2.6, 2.13]

When converting between different encoding schemes, extreme care must be taken in handling any initial byte order marks. The use of the initial byte sequence as a signature on UTF-8 byte sequences is not recommended.

[TUS Ch. 3.10]

Planes

The Unicode codespace consists of the single range of numeric values from 0 to 10FFFF₁₆. It has proven convenient to think of it as divided up into 16 planes of 64K characters each.

The basic multilingual plane (BMP, or Plane 0) contains the common-use characters for all the modern scripts of the world as well as many historical and rare characters.
In the supplementary multilingual plane (SMP, or Plane 1) characters are encoded for scripts or symbols which would not fit into the BMP or occur only infrequently. These include many historic scripts, some notational systems, and a few historic extensions of scripts otherwise encoded in the BMP.
The supplementary ideographic plane (SIP, or Plane 2) is intended for CJK characters [Chinese, Japanese, Korean; really CJKV, including Vietnamese] which did not fit into the BMP.

Planes 3 to 16 are either not yet used, or are intended for special purposes [Plane 14] or for private use [Planes 15 and 16].

[TUS Ch. 2.8, 2.9]

Note that Brill publications routinely contain characters from three planes: BMP, SMP, and SIP. This means that all software used in handling Brill data must be capable of processing characters from all planes of Unicode, not just of Plane 0!

Surrogate code points, surrogate pairs

The method used by UTF-16 to address the 1,048,576 supplementary code points that cannot be represented by a single 16-bit value (past U+FFFF, i.e., in higher planes than plane 0) is called surrogate pairs. The surrogates were added in Unicode Version 2.0 (1996). A surrogate pair is a representation for a single abstract character that consists of a sequence of two 16-bit codes, where the first value of the pair is a high-surrogate code unit and the second value is a low-surrogate code unit.

High-surrogate code points are Unicode code points in the range U+D800 to U+DBFF.
High-surrogate code units are 16-bit code units in the range D800₁₆ to DBFF₁₆, used in UTF-16 as the leading code unit of a surrogate pair.
Low-surrogate code points are Unicode code points in the range U+DC00 to U+DFFF.
Low-surrogate code units are 16-bit code units in the range DC00₁₆ to DFFF₁₆, used in UTF-16 as the leading code unit of a surrogate pair.

Surrogate code points cannot be conformantly interchanged using Unicode encoding forms.

[TUS Ch. 2.4, 2.5, 3.8, 5.4]

Pitfall

Until about the 2010s, many computer systems rarely, if ever, needed to process characters beyond the BMP, and as a result, some software may still (at the time of this writing: 2015) simply assume that no characters beyond the BMP are valid, and strip such characters or otherwise corrupt them. We have often seen SMP and SIP characters input and stored correctly in a CMS, only to have pairs of U+FFFD (REPLACEMENT CHARACTER) returned.

For instance, a character 𝔐 (U+1D510, decimal 𝔐; note the 5-figure hexadecimal number: it is an SMP character) is stored in the source as &#55349;&#56592;. These two decimal values reflect U+D853.DD10. This in fact a surrogate pair, which can be deduced from their hexadecimal values. The use of surrogate pairs is limited to UTF-16. From the order of these two code points it may be concluded that the encoding scheme is big-endian. Therefore, at some point the text was stored as UTF-16BE. But somehow a piece of software subsequently converts these two to U+FFFD.FFFD. There is no going back on this data corruption. Why did this conversion of the surrogate pair take place at all? Either the software cannot handle surrogate pairs and therefore UTF-16, or the software assumed or was told that the data was little-endian, so that each half of the surrogate pair was recognised as an isolated surrogate code point, which makes the string ill-formed, and the two code units were therefore converted to U+FFFD.

Page tree