Page tree

Introduction

An adequate representation of information in non-Latin languages has been one of the strengths of Brill in the “paper period”. The use of online platforms to access our publications still increases. Using PDF as a format for onscreen presentation ensures the same quality that our customers are used to, based on the print. Besides PDF most of the publications are also available in a full text HTML-format. A correct display of non-Latin scripts in HTML format can only be ensured by “pushing” the necessary font(s) together with the webpage. This starts with a proper coding of language and script information in the source xml, as explained hereafter.

The xml:lang-attribute

Both the JATS- and BITS-standard support an attribute @xml:lang. Some important sections from the description of this attribute in the JATS Tag Library:

  1. The language of the intellectual content of the element for which this is an attribute.
  2. The value of this attribute must conform to IETF RFC 5646 (https://tools.ietf.org/html/rfc5646). For most languages, a primary-language subtag such as “fr” (French), “en” (English), “de” (German), or “zh” (Chinese) is sufficient.
  3. In some languages, script codes are also critically important; for example, in Japanese, there is the need to express whether a name is in Kanji as opposed to in Kana (Hiragana or Katakana) to determine sort keys. Best practice is to use the full language-code-plus-script-code as the value for @xml:lang. In our use of both language and script tagging as values for @xml:lang, we are following the IETF (Internet Engineering Task Force) best practice guideline: Network Working Group Request for Comments: 5646 [Tags for Identifying Languages, A. Phillips and M. Davis, Editors, September 2009]. That document defines a language tag as composed of (in part):
    1. A language code Language (typically using the shortest ISO 639)
    2. Potentially followed by a hyphen and then a script code script (using the ISO 15924 code)
  4. Thus, for example, the following are among the expected values of @xml:lang for Japanese, incorporating both a language (“ja”) and a script type:
    1. xml:lang="ja-Hira" (Japanese written in Hiragana)
    2. xml:lang="ja-Hrkt" (Japanese written in Hiragana + Katakana)
    3. xml:lang="ja-Jpan" (Japanese written in Han + Hiragana + Katakana)
    4. xml:lang="ja-Hani" (Japanese written in Kanji (Hanzi, Hanja, Han))
    5. xml:lang="ja-Kana" (Japanese written in Katakana)

The @xml:lang can be assigned to most elements of the JATS and BITS standard. A listing of the elements is available in the JATS and BITS documentation.  Thus, if the contents of the element is completely in a particular language, the attribute can be assigned to the element. For example:

  1. <abstract xml:lang=”zh-Hans”> for a Chinese language abstract in a simplified Chinese script.
  2. <article-title xml:lang=”cop-Copt”> for an article title in Coptic

The <styled-content>-element

Often in Brill publications a few words in a non-Latin script are used in an English- or other language sentence. In this case, the language and script information should only be assigned to these few words and not to the complete element. For this situation, the element <styled-content> should be used(See https://jats.nlm.nih.gov/publishing/tag-library/1.2/element/styled-content.html) . The @xml:lang can be assigned to the <styled-content> element, for example:

  1. … instead of <styled-content xml:lang="mid-Mand">&#x0848;&#x0840;&#x0841;&#x0845;&#x0855;&#x0840;</styled-content> &#x1E6D;abuta … For a word in Mandaic in a sentence. Note that the Mandaic characters are coded using their Unicode values.
  2. … themselves “<styled-content xml:lang="txg-Tang">&#x18736;&#x17D32;&#x170A7;</styled-content>” [tha dźjwij lhji.j] — the “State of Great Xia”… For a word in Tangut. Here also, the characters are coded using their Unicode values to avoid squares if a font to display the values in the xml-editor is not available.
  3. … sage (<styled-content xml:lang="he-Hebr">&#x05DC;&#x05DE;&#x05E9;&#x05DB;&#x05D9;&#x05DC;</styled-content>) to instruct … For a word in Hebrew.

Overview of supported languages / scripts / web fonts

The following table gives an overview of the non-Latin scripts of which we know that they are used in Brill publications. For most of these we have web fonts available, including the instruction for size in relation to the Brill typeface. Please note that because Brill uses language-script tags exclusively to trigger web fonts at this time, language tags may be artificial, as in the case of Aramaic text written in the ‘Hebrew’ square script: in order to simplify the tagging, such text is tagged as ‘he-Hebr’, even though the language tag ‘he’ does not apply to Aramaic, which was and is a language distinct from Hebrew.

no.

Script name 

Language code(s) 

Script code 

001

Latin 

(many) 

Latn

002

Greek 

el 

Grek

003

Cyrillic 

(many) 

Cyrl

004

Old Slavic 

cu

Cyrs

005

Hebrew 

he

Hebr

006

Paleo-Hebrew 

hbo

Phnx

007

Aramaic (biblical) 

he (there is no ‘general’ Aramaic language tag)

Hebr

008

Aramaic (imperial) 

arc

Armi

009

Syriac Estrangelo 

syr

Syre

010

Syriac Serto 

syr

Syrj

011

Arabic 

ar

Arab

012

Armenian 

hy

Armn

013

Coptic

cop

Copt

014

Gəʿəz (Ethiopian)

gez

Ethi

015

Georgian (Mkhedruli, Mtavruli, Khutsuri)

ka

Geor

016

Gothic

got

Goth

017

Samaritan

smp

Samr

018

Syriac, East

syr

Syrn

019

Glagolitic

(for future use; no language tag yet)

Glag

020

Old Turkic

otk

Orkh

021

Devanagari

sa

Deva

022

Tibetan

bo

Tibt

023

Chinese (simplified)

(for future use; no web font in use yet)

Hans

024

Chinese (traditional)

(for future use; no web font in use yet)

Hant

025

Japanese

ja (for future use; no web font in use yet)

Jpan

026

Lisu

lis

Lisu

027

Cypriot syllabary

grc

Cprt

028

Georgian (Asomtavruli)

ka

Geok

029

Logic and mathematics

und

Zmth

030

Tangut

txg

Tang

031

Mandaic

mid

Mand

032

Sindhi

sd

Arab

033

Egyptian hieroglyphs

egy

Egyp

034

Manichaean

xmn

Mani

035

Avestan

ae

Avst

036

Epichoric Greek (archaic local Greek scripts)

grc

Grek-epi

037

Linear B (Mycenaean Greek)

gmy

Linb

038

Uyghur in Arabic script

ug

Arab

039

Persian

fa

Arab

brill.com

At brill.com the xml:lang attribute and <styled-content> element is converted to the corresponding html encoding. Furthermore, the CSS includes the declarations of web fonts used. Web font packages for the scripts listed in the table above are available and will thus ensure a proper display of the characters in html, similar to the pdf-format.