Version 2.1, 8 April 2015

Version history:

Version 1.0, 19 May 2014
Version 2.0, 6 June 2014
Version 2.1, 8 April 2015

Introduction

The encoding of Arabic and Persian text for publishing online and in print needs some attention: there have been instances of basic letters being encoded incorrectly, thereby making files unusable. Service providers unfamiliar with the Arabic and Persian languages and the script used to represent them often follow the copy provided visually, without taking the semantics of characters into account.

Rules for input of Arabic and Persian (Farsi) text

The following rules were determined after some research of current practices and recommendations concerning Arabic and Persian on the Web, and in consultation with the representative of the Sultanate of Oman (MARA) at the Unicode Technical Committee. It is imperative that all service providers involved in the input of Arabic and/or Persian text guarantee to Brill that the person(s) involved in the work know the Arabic and/or Persian language and the Perso-Arabic script, as well as the Unicode Standard insofar as it applies to Arabic and Persian, and UAX #9: Unicode Standard Annex #9, Unicode Bidirectional Algorithm (http://www.unicode.org/reports/tr9/).

Furthermore:

If the text is in Arabic:

Key Arabic kāf (ك) as such, i.e., as Unicode hexadecimal 0643 [U+0643], not as Persian kef (ک, U+06A9).
Key Arabic yāʾ as such (ي, U+064A) and distinguish it from alif maqṣūra (ى, U+0649), which is a different character with its own code. This absolutely requires knowledge of the Arabic language.
Only in Qurʾānic texts (or shorter quotations) may the so-called ‘Farsi’ yeh (ی) with the code U+06CC be used in final position, i.e., before a space character, a punctuation character or parenthesis or bracket, or a return character.

If the text is in Persian:

Key Persian kef as such (ک, U+06A9), not as Arabic kāf (ك, U+0643).
Key Persian yeh and alef-e maqṣūr always as ی (U+06CC), never as Arabic yāʾ (ي, U+064A), and never as alif maqṣūra (ى, U+0649).
Use ZWNJ (ZERO WIDTH NON-JOINER, U+200C) inside Persian words to break character connections when Persian orthography dictates this. Although this device is commonly known as a “half space” (nīm fāṣeleh, نیم فاصله), it has no visible width. It is also known as a “virtual space”, fāṣeleh-ye majāzī, فاصله‌ی مجازی, which is the recognized computer term; and as a “zero space”, fāṣeleh-ye ṣefr, فاصله‌ی صفر.

When sending instructions to any service provider concerning Arabic or Persian text:

Always indicate whether the source text is in Arabic or in Persian. Mixed-language texts should have each language string marked, of whatever length, with an appropriate attribute.
Always include the above rules in your instructions.

Font to use for keying in Arabic and Persian (Farsi) text

Although both Windows and OS X provide fonts with Arabic characters, they are not completely reliable for input of Arabic and Persian text, because the resulting encoding is not entirely unambiguous. A notorious font hack occurs with the theonym ‘Allah’: when a user types alif-lām-lām-hāʾ, most fonts immediately substitute a glyph containing not only those letters, but also šadda (◌ّ) and dagger alif (◌ّ), thus – الله. This happens despite the fact that other spellings may be what the user wants, or what the original source demands, like a spelling without šadda and dagger alif, or (standard contemporary Qurʾān orthography) with šadda and fatḥa (◌َ).

The font that allows all orthographies to be input and encoded correctly is Scheherazade, a free download. Please note that although Scheherazade has the great advantage of allowing orthographically correct input and encoding, it is unfortunately not really suitable as an output font, because it it typographically simplistic, and deliberately so. Currently only the DecoType Tasmeem fonts (Emiri, Naskh, Nastaliq) provide typographically acceptable rendering of Arabic; this demands the use of InDesign with the special Tasmeem plugin. On-screen rendering of Arabic text, for instance on the Web, will remain problematic for some time. It is quite possible that .svg font technology (‘SVG’ stands for ‘Scalable Vector Graphics’) in tandem with a well-developed Arabic shaping engine will at last make sophisticated Arabic typography on the Web possible.

It is not only font technology which can cause wrong or ambiguous encoding. The Unicode Standard itself has certain lacunae which make it currently impossible to encode the Qurʾān unambiguously. For instance, there is no support for hamza occurring not in the rasm nor above an individual consonant, but above the rasm between two consonants. There is also a need for so-called ‘archigraphemes’ of consonants to be encoded, together with the various dots used to disambiguate archigraphemes, these dots being encoded as combining diacritical characters.

Until all these standardization (Unicode) and font technology issues have been resolved, the encoding of Arabic (and Persian) text will remain tricky and demand close attention from all involved.

Bidirectional aspects

To be added [PR]