Page tree

Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

Table of Contents
Version 1.2

...

.1, 25 March 2022

Version history:

  • 1.0, 12 November 2015

  • 1.1, 28 May 2018

  • 1.2, 13 August 2021
  • 1.2.1, 25 March 2022

Introduction

Some of the most basic technical concepts used when talking about Unicode encoding of text are to be found in chapter 2, General Structure, of the Unicode Standard [currently (August of 2021) at version 13.0; abbreviated here as ‘TUS’; other parts of TUS may also be referenced]. We have noticed that some links in software chains make incorrect assumptions about the encoding of data, resulting in data loss and data corruption. This document points to vital information and it is henceforth assumed that all service providers and software vendors take particular care in meticulously addressing these aspects of the Unicode Standard, while adhering to and implementing this international standard in its totality. The following paragraphs give only rough and imprecise descriptions of some important low-level concepts which must be considered when debugging a faulty data flow and/or conversion: software implementers, text processing and hosting service providers, web and CSS designers, etc., must all refer to the full text of the Unicode Standard in order to reach a proper understanding.

Characters, code points

Unicode encodes characters, such as ‘c’ and ‘ç’: these may be represented by one or more code points. For instance, ‘ç’ can be encoded using the single U+00E7 code point (‘U+’ indicating that what follows immediately is a Unicode hexadecimal scalar value); but also as a combination of ‘c’ character and the combining cedilla character ‘◌̧’ (U+0063.0327, being the combination of hexadecimal 0063 and hexadecimal 0327, with the dot indicating concatenation of the two values).

[TUS Ch. 2.1, 2.2, 2.4]

Encoding forms, bits, bytes, encoding schemes, byte order mark (BOM)

In the Unicode character encoding model, precisely defined encoding forms specify how each integer (code point) for a Unicode character is to be expressed as a sequence of one or more code units. The Unicode Standard provides three distinct encoding forms for Unicode characters, using 8-bit, 16-bit, and 32-bit units. These are named UTF-8, UTF-16, and UTF-32, respectively. They are all equally valid.

...

For UTF-8, the encoding scheme consists merely of the UTF-8 code units (= bytes) in sequence. In this scheme, there is no issue of big- versus little-endian byte order. Use of a BOM is neither required nor recommended for UTF-8, but may be encountered in contexts where UTF-8 data is converted from other encoding forms that use a BOM, or where a BOM is used as a UTF-8 signature. Both UTF-16 and UTF-32 can be big-endian or little-endian.

...

When converting between different encoding schemes, extreme care must be taken in handling any initial byte order marks. Note that many XML processing applications (such as the Highwire Online platforms) do not process XML with the BOM (Byte Order Mark) in place. So:

Remove all BOMs (Byte Order Marks) from XML files!

In the Oxygen Preferences → “Encoding” page you must switch “UTF-8 BOM handling” to “Don’t Write”. Make a small change in the XML and save the file again. Oxygen should remove the BOM from the beginning of the XML content.

[See also TUS Ch. 3.10]

Planes

The Unicode codespace consists of the single range of numeric values from 0 to 10FFFF₁₆. It has proven convenient to think of it as divided up into 16 planes of 64K characters each.

  • The basic multilingual plane (BMP, or Plane 0) contains the common-use characters for all the modern scripts of the world as well as many historical and rare characters.

  • In the supplementary multilingual plane (SMP, or Plane 1) characters are encoded for scripts or symbols which would not fit into the BMP or occur only infrequently. These include many historic scripts, some notational systems, and a few historic extensions of scripts otherwise encoded in the BMP.

  • The supplementary ideographic plane (SIP, or Plane 2) is intended for CJK characters [Chinese, Japanese, Korean; really CJKV, including Vietnamese] which did not fit into the BMP.

  • The tertiary ideographic plane (TIP, or Plane 3) received CJK Unified Ideographs Extension G (30000–3134F) with Unicode 13.0.0, in March 2020.

Planes 4 Planes 3 to 16 are either not yet used, or are intended for special purposes [Plane 14] or for private use [Planes 15 and 16].

...

Note that Brill publications routinely contain characters from three four planes: BMP, SMP, SIP, and SIPTIP. This means that all software used in handling Brill data must be capable of processing characters from all planes of Unicode, not just of Plane 0!

Surrogate code points, surrogate pairs

The method used by UTF-16 to address the 1,048,576 supplementary code points that cannot be represented by a single 16-bit value (past U+FFFF, i.e., in higher planes than plane 0) is called surrogate pairs. The surrogates were added in Unicode Version 2.0 (1996). A surrogate pair is a representation for a single abstract character that consists of a sequence of two 16-bit codes, where the first value of the pair is a high-surrogate code unit and the second value is a low-surrogate code unit.

...

[TUS Ch. 2.4, 2.5, 3.8, 5.4]

Pitfall

Until about the 2010s, many computer systems rarely, if ever, needed to process characters beyond the BMP, and as a result, some software may might still (at the time of this writing: 2015even now, in 2021) simply assume that no characters beyond the BMP are valid, and strip such characters or otherwise corrupt them. We have often seen SMP and SIP characters input and stored correctly in a CMS, only to have pairs of U+FFFD (REPLACEMENT CHARACTER) returned.

...