Page tree

Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

Table of Contents
Version 1.2.1, 25 March 2022

Version history:

  • 1.0, 12 November 2015

  • 1.1, 28 May 2018

  • 1.2, 13 August 2021
  • 1.2.1, 25 March 2022

Introduction

Some of the most basic technical concepts used when talking about Unicode encoding of text are to be found in chapter 2, General Structure, of the Unicode Standard [currently (August of 2021) at version 13.0; abbreviated here as ‘TUS’; other parts of TUS may also be referenced]. We have noticed that some links in software chains make incorrect assumptions about the encoding of data, resulting in data loss and data corruption. This document points to vital information and it is henceforth assumed that all service providers and software vendors take particular care in meticulously addressing these aspects of the Unicode Standard, while adhering to and implementing this international standard in its totality. The following paragraphs give only rough and imprecise descriptions of some important low-level concepts which must be considered when debugging a faulty data flow and/or conversion: software implementers, text processing and hosting service providers, web and CSS designers, etc., must all refer to the full text of the Unicode Standard in order to reach a proper understanding.

Characters, code points

Unicode encodes characters, such as ‘c’ and ‘ç’: these may be represented by one or more code points. For instance, ‘ç’ can be encoded using the single U+00E7 code point (‘U+’ indicating that what follows immediately is a Unicode hexadecimal scalar value); but also as a combination of ‘c’ character and the combining cedilla character ‘◌̧’ (U+0063.0327, being the combination of hexadecimal 0063 and hexadecimal 0327, with the dot indicating concatenation of the two values).

[TUS Ch. 2.1, 2.2, 2.4]

Encoding forms, bits, bytes, encoding schemes, byte order mark (BOM)

In the Unicode character encoding model, precisely defined encoding forms specify how each integer (code point) for a Unicode character is to be expressed as a sequence of one or more code units. The Unicode Standard provides three distinct encoding forms for Unicode characters, using 8-bit, 16-bit, and 32-bit units. These are named UTF-8, UTF-16, and UTF-32, respectively. They are all equally valid.

...

In the Oxygen Preferences → “Encoding” page you must switch “UTF-8 BOM handling” to “Don’t Write”. Make a small change in the XML and save the file again. Oxygen should remove the BOM from the beginning of the XML content.

[See also TUS Ch. 3.10]

Planes

The Unicode codespace consists of the single range of numeric values from 0 to 10FFFF₁₆. It has proven convenient to think of it as divided up into 16 planes of 64K characters each.

...

Note that Brill publications routinely contain characters from four planes: BMP, SMP, SIP, and TIP. This means that all software used in handling Brill data must be capable of processing characters from all planes of Unicode, not just of Plane 0!

Surrogate code points, surrogate pairs

The method used by UTF-16 to address the 1,048,576 supplementary code points that cannot be represented by a single 16-bit value (past U+FFFF, i.e., in higher planes than plane 0) is called surrogate pairs. The surrogates were added in Unicode Version 2.0 (1996). A surrogate pair is a representation for a single abstract character that consists of a sequence of two 16-bit codes, where the first value of the pair is a high-surrogate code unit and the second value is a low-surrogate code unit.

...

[TUS Ch. 2.4, 2.5, 3.8, 5.4]

Pitfall

Until about the 2010s, many computer systems rarely, if ever, needed to process characters beyond the BMP, and as a result, some software might still (even now, in 2021) simply assume that no characters beyond the BMP are valid, and strip such characters or otherwise corrupt them. We have often seen SMP and SIP characters input and stored correctly in a CMS, only to have pairs of U+FFFD (REPLACEMENT CHARACTER) returned.

...