Download Free Audio of Unicode As computers speak in the language of 1�... - Woord

Read Aloud the Text Content

This audio was created by Woord's Text to Speech service by content creators from all around the world.


Text Content or SSML code:

Unicode As computers speak in the language of 1’s and 0’s, they assign and store characters of a script with a unique numeric value. This process is known as character encoding and is the bedrock of communication in information technology. Character encoding gives meaning and allows the computer to differentiate and interpret data such as letters, numbers and symbols. For text to be accurately exchanged on a global scale, regardless of the software or its origin, the characters of a script must be encoded using a consistent and universally applied method to avoid malfunction. The Unicode Standard provides this standardisation of encoding characters. For a script to gain a published table of encoded characters, in other words, gain Unicode support, the organisation must first approve a formal proposal. (Unicode, With a character encoding table, type designers can map each glyph within a font to its respective character code, thus reinforcing encoding standardisation and generating usability of encoded characters. Unicode has become an essential component to achieve typographic stability when using the Khmer script on-screen. It has been the recognised encoding standard for Khmer in Cambodia since 2001. Before this, an estimated 30 competing formats were used, “which simply painted Khmer letters over existing Roman alphabet characters” (Phnom Penh Post, 2008). Multiple encoding formats are a breeding ground for inconsistencies, notably in glyph-to-character mapping. Without Unicode standardisation, Khmer text would be typographically unstable. For example, switching from one font to another with different character encoding properties would change the text—not just the font as the user intended (Open Forum of Cambodia, 2005). Text Layout Text layout is a crucial aspect of typography, assisting in the delivery and readability of a message. It is often synonymous with design considerations such as composition, rhythm, proportion and hierarchy. However, it is also fundamental to software development and text processing as it is responsible for how the integral structure of a writing system is displayed on-screen. This aspect of text layout is particularly crucial when handling complex scripts, such as Khmer, which has contextual and non-linear typographic requirements. It is less apparent and less of a high-level priority when dealing with simple scripts such as Latin, one of the easiest to display (Unicode, ICU Documentation, n.d). Latin characters can be processed from left to right, corresponding to how they are stored, through the Unicode character encoding table (Unicode, n.d). The script responds in a what-you-see-is-what-you-get-like fashion, and contextual analysis is unnecessary. Therefore, its rendering behaviour and layout requirements are forthright. Complex scripts, such as Khmer, require measures beyond the straightforward transformation of character code points into glyphs. These measures are needed for around half of the world’s writing systems and are handled collaboratively by layout engines, shaping engines, multiple algorithms and fonts (Hudson, 2016) (Windows, 2015). The following is a simplified diagram using the OpenType model, presented by John Hudson of Tiroworks, highlighting that collaboration. The first process by which text passes is through the text layout engine. It is responsible for interpreting high-level layout requirements. In Khmer, one such requirement is the task of supporting line-breaking and word-breaking. This task poses a significant technological challenge as the writing system does not use delimiters such as blank spaces or symbols to indicate boundaries between words. Spaces instead are used to mark the end of a sentence or phrase. This feature of the script broadly impacts the operation of text processing. Moreover, whilst word identification may be no challenge for the native Khmer reader, creating algorithms capable of accurate calculations that not only distinguish but display text in the correct manner intended are highly complex. The task of line-breaking directly intertwines with the ability to detect word boundaries. If line breaking must deal with the constraints determined by the maximum width available, the expectation is that they would occur at word boundaries. To indulge disregard for such an expectation, one might conclude that readability and legibility are severely compromised. The inability to discern one word from another creates a causal sequence of typographic obstructions due to illegal or inadequate breaking. Instances of lexical semantics, such as being able to divide a sentence in multiple ways, create the complexity aforementioned in calculating the optimal line break without word boundary indicators. Syllable formation often causes this ambiguity as the absence of delimiters between words can make an interpretation inaccurate. Thus, creating a technically valid but unintended meaning, for example, ទារក (infant), or ទា | រក (duck/ find), or ទារ | ក (hungry/ neck) (Tran Van Nam, Nguyen Thi Hue and Phan Huy Khanh, 2017). Subword breaking is a concept developed to provide a middle ground between breaking words and characters (Khanna. C, 2021). As discussed, an optimal line break expects that they will occur at a word boundary. However, in Khmer, different rules apply to inter-word breaking. Line breaks between words, i.e. hyphenation, follow the strict principle that they must only occur between syllables (Haralambous, Y. 1993). This rule then implies that an algorithm capable of detecting syllables is necessary for inter-word breaking and “such an algorithm is beyond the scope of Unicode.” (Unicode, 2022). Breaking text at subword syllables is algorithmically challenging “because syllable-final consonants are indistinguishable from consonants with an inherent vowel that constitutes a new syllable. Some kind of morphological analysis is needed” (Ishida, R. 2022). According to Danh Hong, a Khmer font designer, Khmer words can be classified into three categories: A single word: ជាតិ ‘national’, វិទ្យាល័យ ‘highschool’, កម្ម ‘action’ An affix word: prefix: អន្តរជាតិ ‘international’, មហវិទ្យាល័យ ‘high school’ | suffix: កម្មករ ‘workers’ A compound word: e.g ជាតិសាសន៍ ‘ethnicity’, កម្មផល ‘karma’, សកលវិទ្យាល័យ ‘university’ He also states that hyphenation is only theoretically possible in the context of a compound word. For example: ជាតិសាសន៍ ‘ethnicity’, is broken into |ជាតិ|សាសន៍|, កម្មផល ‘karma’ > |កម្ម|ផល|, and សកលវិទ្យាល័យ ‘university’ > |សកល|វិទ្យាល័យ| (Github, n.d)