Basic Unicode encoding concepts (technical)

Version 1.2

...

.1, 25 March 2022

Version history:

1.0, 12 November 2015
1.1, 28 May 2018
1.2, 13 August 2021
1.2.1, 25 March 2022

Introduction

Some of the most basic technical concepts used when talking about Unicode encoding of text are to be found in chapter 2, General Structure, of the Unicode Standard [currently (August of 2021) at version 13.0; abbreviated here as ‘TUS’; other parts of TUS may also be referenced]. We have noticed that some links in software chains make incorrect assumptions about the encoding of data, resulting in data loss and data corruption. This document points to vital information and it is henceforth assumed that all service providers and software vendors take particular care in meticulously addressing these aspects of the Unicode Standard, while adhering to and implementing this international standard in its totality. The following paragraphs give only rough and imprecise descriptions of some important low-level concepts which must be considered when debugging a faulty data flow and/or conversion: software implementers, text processing and hosting service providers, web and CSS designers, etc., must all refer to the full text of the Unicode Standard in order to reach a proper understanding.

...

For instance, a character 𝔐 (U+1D510, decimal 𝔐; note the 5-figure hexadecimal number: it is an SMP character) is stored in the source as &#55349;&#56592;. These two decimal values reflect U+D853.DD10. This in fact a surrogate pair, which can be deduced from their hexadecimal values. The use of surrogate pairs is limited to UTF-16. From the order of these two code points it may be concluded that the encoding scheme is big-endian. Therefore, at some point the text was stored as UTF-16BE. But somehow a piece of software subsequently converts these two to U+FFFD.FFFD. There is no going back on this data corruption. Why did this conversion of the surrogate pair take place at all? Either the software cannot handle surrogate pairs and therefore UTF-16, or the software assumed or was told that the data was little-endian, so that each half of the surrogate pair was recognised as an isolated surrogate code point, which makes the string ill-formed, and the two code units were therefore converted to U+FFFD.

Table of Contents
Version 1.2, 13 August 2021

Version history:

1.0, 12 November 2015
1.1, 28 May 2018
1.2, 13 August 2021

...

Page tree

Versions Compared

Old Version 13

New Version 14

Key

Table of Contents

Version 1.2

.1, 25 March 2022

Introduction

Table of Contents
Version 1.2, 13 August 2021

Page tree

Versions Compared

Old Version 13

New Version 14

Key

Table of Contents

Version 1.2

.1, 25 March 2022

Introduction

Table of ContentsVersion 1.2, 13 August 2021

Table of Contents
Version 1.2, 13 August 2021