Version 2.1, 8 April 2015 

Version history:

Introduction

The encoding of Arabic and Persian text for publishing online and in print needs some attention: there have been instances of basic letters being encoded incorrectly, thereby making files unusable. Service providers unfamiliar with the Arabic and Persian languages and the script used to represent them often follow the copy provided visually, without taking the semantics of characters into account.

Rules for input of Arabic and Persian (Farsi) text

The following rules were determined after some research of current practices and recommendations concerning Arabic and Persian on the Web, and in consultation with the representative of the Sultanate of Oman (MARA) at the Unicode Technical Committee. It is imperative that all service providers involved in the input of Arabic and/or Persian text guarantee to Brill that the person(s) involved in the work know the Arabic and/or Persian language and the Perso-Arabic script, as well as the Unicode Standard insofar as it applies to Arabic and Persian, and UAX #9: Unicode Standard Annex #9, Unicode Bidirectional Algorithm (http://www.unicode.org/reports/tr9/).

Furthermore:

If the text is in Arabic:

If the text is in Persian:

When sending instructions to any service provider concerning Arabic or Persian text:

Font to use for keying in Arabic and Persian (Farsi) text

Although both Windows and OS X provide fonts with Arabic characters, they are not completely reliable for input of Arabic and Persian text, because the resulting encoding is not entirely unambiguous. A notorious font hack occurs with the theonym ‘Allah’: when a user types alif-lām-lām-hāʾ, most fonts immediately substitute a glyph containing not only those letters, but also šadda (◌ّ) and dagger alif (◌ّ), thus – الله. This happens despite the fact that other spellings may be what the user wants, or what the original source demands, like a spelling without šadda and dagger alif, or (standard contemporary Qurʾān orthography) with šadda and fatḥa (◌َ).

The font that allows all orthographies to be input and encoded correctly is Scheherazade, a free download. Please note that although Scheherazade has the great advantage of allowing orthographically correct input and encoding, it is unfortunately not really suitable as an output font, because it it typographically simplistic, and deliberately so. Currently only the DecoType Tasmeem fonts (Emiri, Naskh, Nastaliq) provide typographically acceptable rendering of Arabic; this demands the use of InDesign with the special Tasmeem plugin. On-screen rendering of Arabic text, for instance on the Web, will remain problematic for some time. It is quite possible that .svg font technology (‘SVG’ stands for Scalable Vector Graphics’) in tandem with a well-developed Arabic shaping engine will at last make sophisticated Arabic typography on the Web possible.

It is not only font technology which can cause wrong or ambiguous encoding. The Unicode Standard itself has certain lacunae which make it currently impossible to encode the Qurʾān unambiguously. For instance, there is no support for hamza occurring not in the rasm nor above an individual consonant, but above the rasm between two consonants. There is also a need for so-called ‘archigraphemes’ of consonants to be encoded, together with the various dots used to disambiguate archigraphemes, these dots being encoded as combining diacritical characters.

Until all these standardization (Unicode) and font technology issues have been resolved, the encoding of Arabic (and Persian) text will remain tricky and demand close attention from all involved.

Bidirectional aspects

To be added [PR]

See also Pagination of critical editions with non-Western scripts