Basic Unicode encoding concepts (technical)

...

For instance, a character 𝔐 (U+1D510, decimal 𝔐; note the 5-figure hexadecimal number: it is an SMP character) is stored in the source as &#55349;&#56592;. These two decimal values reflect U+D853.DD10. This in fact a surrogate pair, which can be deduced from their hexadecimal values. The use of surrogate pairs is limited to UTF-16. From the order of these two code points it may be concluded that the encoding scheme is big-endian. Therefore, at some point the text was stored as UTF-16BE. But somehow a piece of software subsequently converts these two to U+FFFD.FFFD. There is no going back on this data corruption. Why did this conversion of the surrogate pair take place at all? Either the software cannot handle surrogate pairs and therefore UTF-16, or the software assumed or was told that the data was little-endian, so that each half of the surrogate pair was recognised as an isolated surrogate code point, which makes the string ill-formed, and the two code units were therefore converted to U+FFFD.

Page tree

Versions Compared

Old Version 2

New Version 3

Key