Polytonic Greek Unicode Still Isn’t Perfect

Whether we’re talking about fonts, programming languages, keyboard entry or even the command-line, support for polytonic Greek has greatly improved even in the last 10 years much less the 23 years since I’ve been doing computational analysis of Greek texts.

UPDATE (2016-12-04): The Skolar examples in this post will no longer make sense as the issues have now been fixed. See Diacritic Stacking in Skolar PE Fixed.

With configurable input sources in OS X, it’s easy to type polytonic Greek and the default fonts support all the Unicode codepoints for polytonic Greek. I can now just type Greek (rather than a transliteration or BetaCode) in data files or forum posts or emails or tweets or GitHub issues. There are still some display issues with using polytonic Greek in fixed-width fonts but that’s improving. Last year I talked about the bug I reported that got fixed in the Atom editor.

Python has long supported Unicode and Python 3 made it even easier to deal with text processing of Unicode files. It doesn’t sort polytonic Greek correctly out of the box, but I wrote pyuca to solve that problem!

The situation seemed almost perfect until I started doing a lot more work that required me to track vowel length and, in particular use a macron ˉ to distinguish long α, ι, and υ from short. It’s okay when the macron is the only diacritic on a vowel: the problems start when a vowel has both an acute and a macron. (There is no need for a macron and a circumflex as the circumflex already implies the vowel is long. Same with an iota subscript.)

Problem 1: No precomposed character code points

ᾱ can be written as the decomposed U+03B1 U+0304 or the precomposed U+1FB1:

>>> len('ᾱ')
1
>>> [hex(ord(ch)) for ch in 'ᾱ']
['0x1fb1']    
>>> [unicodedata.name(ch) for ch in 'ᾱ']
['GREEK SMALL LETTER ALPHA WITH MACRON']
>>> unicodedata.decomposition('ᾱ')
'03B1 0304'    

ά can be written as the decomposed U+03B1 U+0301 or the precomposed U+03AC (assuming normalization to a tonos which the Greek Polytonic Input Source on OS X does):

>>> len('ά')
1
>>> [hex(ord(ch)) for ch in 'ά']
['0x3ac']
>>> [unicodedata.name(ch) for ch in 'ά']
['GREEK SMALL LETTER ALPHA WITH TONOS']
>>> unicodedata.decomposition('ά')
'03B1 0301'

But there’s no precomposed character ᾱ́:

>>> len('ᾱ́')
2
>>> [hex(ord(ch)) for ch in 'ᾱ́']
['0x1fb1', '0x301']
>>> [hex(ord(ch)) for ch in unicodedata.normalize('NFC', 'ᾱ́')]
['0x1fb1', '0x301']    

As you can see, even Python 3 views ᾱ́ as two characters. This also screws up font metrics in many text editors and browser text areas (like the one I’m writing this post in).

Problem 2: Many fonts with otherwise excellent polytonic Greek support don’t display it properly

The Skolar PE font I use on this site can’t properly display ᾱ́. It displays it as ᾱ́. Ironically this is one time the fixed width fonts do a better job!

Problem 3: You can’t normalize an alternative ordering of diacritics

If you already have a GREEK SMALL LETTER ALPHA WITH TONOS and you add a COMBINING MACRON you end up (at least in the fonts I’ve tried) with something that even visually looks different from the GREEK SMALL LETTER ALPHA WITH MACRON followed by COMBINING ACUTE ACCENT:

>>> "\u03ac\u0304"
'ά̄'

(Notice that ά̄ != ᾱ́ and oddly, Skolar PE does a better job of the former than the latter: ά̄ vs ᾱ́)

And to make matters worse, you can’t normalize one to the other:

[hex(ord(ch)) for ch in unicodedata.normalize('NFC', '\u03ac\u0304')]
['0x3ac', '0x304']

You have to combine the components in the correct order with the macron FIRST:

>>> [hex(ord(ch)) for ch in unicodedata.normalize('NFC', '\u03b1\u0304\u0301')]
['0x1fb1', '0x301']
>>> [hex(ord(ch)) for ch in unicodedata.normalize('NFC', '\u03b1\u0301\u0304')]
['0x3ac', '0x304']

This is not a bug: technically ά̄ and ᾱ́ are distinct graphemes but it’s still an annoyance because it requires any code that adds diacritics to need to know the correct order in which to add them.

Problem 4: No support in the Greek Polytonic Input Source

The Greek Polytonic Input Source supports typing a digraph (diacritic then base) to produce precomposed characters but you can’t use a trigraph to enter ᾱ́. In fact, every time I’ve needed to type ᾱ́ in this post, I’ve needed to copy paste it from an earlier usage (and manually minted one via Python the first time).

Problem 5: My existing syllabification heuristics didn’t work

I recently had to tweak the syllabification heuristics in my greek-accentuation Python library to correctly syllabify words like φῡ́ω. Prior to 0.9.4, it put a syllable division between the macron and the acute!

This would have not happened if Unicode (and hence Python) treated ῡ́ as a single character.

Problem 6: There’s also breathing

I thought I was all set after fixing Problem 5 but then I hit the imperfect of ἵστημι which starts in most cases with ῑ́̔/ῑ̔́ (yes, that should be a rough breathing and acute with a macron.) I’m in the process of working around this problem in greek-accentuation now.

The Solution

The root cause of all this is just that Unicode-based code can’t treat ῑ́̔ or ῡ́ or ᾱ́ as single characters because Unicode doesn’t have a codepoint for the precomposed characters. I imagine it’s a long road to get the Unicode Consortium to “fix” this, if it’s even possible. And even if some future version of Unicode fixed it; I’d have to wait for Python and OS X to catch up before the problem really goes away. For now I’ll just have to continue to work around the problem in code like my greek-accentuation library. That still doesn’t solve the problem with the Skolar PE fonts but I might be able to raise that issue with the font foundry.

It’s possible there are additional workarounds or tricks I’m not aware of. If there are, please let me know.

CORRECTION: Thanks to Tom Gewecke for pointing out an earlier misstatement about the Polytonic Greek Input Source on OS X producing combining characters. It does not. It supports digraphs to produce precomposed characters.

CORRECTION: Thanks to Martin J. Dürst for pointing out that ά̄ and ᾱ́ are distinct graphemes and so the fact they aren’t normalized to each other isn’t a problem with Unicode as such.

UPDATE: I remarked at the end of Problem 1 about font metrics in editors / text areas but really I should make that a separate problem. Related (and perhaps yet another problem) is selecting characters with multiple diacritics.

Updated Solution

Now see my later post: An Updated Solution to Polytonic Greek Unicode’s Problems.