Unicode and Excellence in Technical Research

While digging around in my musty Unicode mailing list archives, I can across a true classic.

For those of you who wonder what a quality technical mailing list post looks like, read Ken Whistler’s essay below on the “High Ogonek” character.

Side note – it’s of particular interest to me when I wound up doing the same kind of forensic character set research as part of my work on internationalizing the Mac OS. For me, the letter in question was the mythical “Y with diaeresis”, which had been faithfully ported to the Macintosh “Roman” character set from the Lisa character set.

But nobody really knew what language used it. Rumor in the hallways was that it somehow came into the Lisa character set from a Turkish character set. In the end there wasn’t sufficient information to declare it null and void, so we left it as-is.

Now you can actually do a Google search on ÿ, and find a Wikipedia article that references its use in Greek transcription and rare French place names like “L’Haÿ-les-Roses“, but nothing about Turkish.

On to Ken’s Opus, from 4 April 1991:

Warning to readers: This contribution contains real research, so if you haven’t got time to care, you can delete it now!

The “High Ogonek” has stuck in my craw for so long that I feel I must say something about it. The High Ogonek is symptomatic of one of the things wrong about the character standardization business, which encourages the blithe perpetuation of mistaken “characters” from standard to standard, like code viruses. At least, in the past, the epidemic was constrained by the fact that the encoding bodies only had 256 cells which could get infected by such abominations as half-integral signs. Now, however, with Unicode and ISO 10646 and the AFII registry, and other 2 byte corporate standards, the number of cells available for infection is vast, and the temptation to encode everybody else’s junk just seems to have become irresistible.


“High Ogonek” can be found in ISO DIS 10646 (JTC1/SC2/WG2 N666) at 034/126. What is it? Well, that’s a good question, and 10646 doesn’t provide a clue–but then it doesn’t say anything about where any of its content comes from. But for those in the know, the source of “High Ogonek” in the DIS 10646 can be tracked to ECMA/TC1/90/15, Latin Alphabet No. 6, and more specifically to Appendix A, which reproduces 34 characters “registered according to ISO 2375 as Registration No. 158”, for “text in the Skolt Lappish dialect, as well as texts using older Lappish orthography…” Position 03/00 in the code table of Registration No. 158 is our critter. So now we know what it is, right? Wrong. The ill-defined squiggle in position 03/00 does indeed look something like an ogonek (mistaken ogonek forms are themselves another tale of woe I won’t get into here), and the “ogonek” in 03/00 is indeed high in its box–hence the “High Ogonek” in DIS 10646, drawn in position 034/126 as a nondescript rightward hook.

Well, reviewers of 10646 have complained about “High Ogonek”, and something has indeed been done. In JTC1/SC2/WG2 N680 “Updated code table charts”, dated 22 March 1991, the “High Ogonek” has now been printed using a high reversed comma, quite sharply distinguished from the “Ogonek” at 033/178. In fact, it looks remarkably like an aspiration mark–hmmm. For those of you with long memories or big filing cabinets, the 2nd DP of 10646 had just such a thing at 171/072, labeled “IPA ASPIRATION MARK”, but all the IPA later disappeared in the DIS, just as the strange “High Ogonek” appeared.

N680 was “generated by AFII using their publishing system,” so it would behoove us to check whether the “High Ogonek” virus has spread to AFII–and guess what! The draft AFII registry has a new glyph id 043B/241B devoted especially to printing the 10646 “High Ogonek”. The AFII glyph looks like a high reversed comma, and is labeled:

“High ogonek” (not a non-spacing character, but rather a separate character within words) (Lapp)

That’s strange, because AFII has what appears to be the same glyph encoded at 342B/110B, labeled:

Aspirated, IPA

So AFII and 10646 seem to have decided these things are different. Welcome to the “High ogonek”.

What about Unicode? I don’t think I would be telling any tales out of school if I revealed that Unicode almost got a “High ogonek”, too, since Unicode was busy incorporating all the 10646 mistakes in Unicode while 10646 was busy incorporating all the Unicode mistakes in 10646. (Gives you an Excedrin headache, doesn’t it?) But some degree of reason has prevailed, and the Skolt Lappish “High Ogonek” is now simply mapped to Unicode U+02BD MODIFIER LETTER REVERSED COMMA (which is explicitly intended as the IPA aspiration mark).

Is that the right answer? Well, how about doing what should have been done in the first place–some research–instead of just citing other character standards like holy books.


Based on a fairly quick survey, I note three broad groups of treatment of Lappish transcription:

1. Prewar (pre World War II) publications using systems based on Finno-Ugrian practice (which itself is an offshoot of the transcription used by Indo-Europeanists). Non-phonemic, non-systematic phonetic, and inconsistently narrow transcription.

2. Early postwar publications. Systematic phonemic, but with a nod to old-fashioned transcription and IPA usages.

3. “Modern” publications (70’s and 80’s). Phonemic, with systematic phonetic realization rules, and with tuned practical orthographies. (E.g. “sj” for esh, rather than s-acute or s-hacek, etc.)

Going from best to worst, i.e. recent to early, we have the following facts.

In modern treatments, aspiration is not part of Lappish orthography. Why? I’ll let the best analyst explain it:

Die Verschlusslaute werden in phonetischer Hinsicht entweder als mehr oder weniger stimmhafte Lenes [b d g] oder als stimmlose Fortes realisiert. Die letzteren ko”nnen entweder unaspiriert [p t k], pra”aspiriert [hp(p) ht(t) hk(k)] oder postaspiriert [ph th kh] ausgesprochen werden.

(Su”dlappisches Wo”rterbuch, Gustav Hasselbrink, Uppsala 1981, Ab Lundequistska Bokhandeln, p. 42.) In other words (South) Lapp has a lenis and a fortis series of stops, and the fortis series may be either unaspirated, preaspirated (in geminate contexts) or postaspirated, depending on the context. Since degree of aspiration is predictable by context, it need not be represented in the orthography. However, when Hasselbrink wants to explicitly transcribe aspiration phonetically, he does so with an inline “h” or a raised “h”–the distinction being primarily whether phonological pattern or phonetic quality is in question.

G. M. Kert published a very similar analysis in Saamskii Yazyk, Leningrad 1971, Soviet Academy of Sciences. See, for example, the phonological chart on p. 63. (I won’t quote anything–Cyrillic in ASCII is too painful.)

The early postwar treatments of Lapp also use a standardized orthography for Lapp, with two stop series, but are sometimes hazier about the status of each series. They also tend to use the {raised reversed comma} to indicate aspiration explicitly. Examples are: Wo”rterbuch des Waldlappendialekts von Mala{ring} und Texte zur Ethnographie, Wolfgang Schlachter, Helsinki 1958, Suomalais- Ugrailainen Seura. Also: The Lappish Dialect of Jukkasjo”rvi, A Morphological Survey, Bjo”rn Collinder, Uppsala, 1949, Almqvist & Wiksells Boktryckeri Ab:

31. k, p, t are unaspirated (as c, p, t in French) if they are not followed by the sign [{raised reverse comma}] (see Section 59).
–p. 11

Then we get to the pre-phonemic transcriptions. These have no systematic understanding of phonological derivation and phonetic realization, and tend to have either broad or narrow “phonetic” orthographies, with symbols derived from Finno-Ugrian practice. Example 1: Lappisher Wortschatz, Eliel Lagercrantz, Helsinki, 1939, Suomalais-Ugrilainen Seura (2 vols.). This lexicon systematically transcribes aspiration, and does so with a {raised small cap h} after stop consonants.

Example 2 is a massive work, and represents the extreme of unsystematic narrow phonetic transcription: Lappisk Ordbok, Konrad Nielsen, Oslo 1962, Universitetsforlaget (5 vols.). Don’t let the date of publication fool you–the words were collected from 1906-1911, the compilation was begun in 1929, and the first signature was printed in 1930. Nielsen uses a plethora of diacritics for all kinds of things, since this is a cross-dialectal compilation. For explicit aspiration, he uses a {raised left half ring} (cf. Unicode U+02BF), which is a common Indo-European and/or Finno-Ugrian typographical substitute for the {raised reversed comma}. Since Nielsen also follows the Indo-European tradition of typesetting cited forms in italics, the {raised left half ring} also gets leaned over a bit and then is strongly kerned up over the “knee” of the “k”‘s or “h”‘s (yes!, aspirated “h”‘s), and nestles in above the cross-bar’s of the “t”‘s. So for the typesetter, these aspirated forms were probably a single piece of type, but the analysis clearly shows the {raised left half ring} to be, in principle, a “spacing” diacritic following a stop (or “h”).

My brief survey of these works did not turn up any specifically dealing with the “Skolt Lapp” dialect, but the general picture is clear. Aspirated phones do exist in Lappish dialects, and the aspiration has been traditionally transcribed using either a {raised reversed comma} or a typographical variant of that, the {raised left half ring}. The Skolt Lapp texts referred to in ECMA/TC1/90/15 presumably follow this orthographic tradition, influenced by Nielsen or other early analysts. Modern Lapp orthographies omit transcription of aspiration altogether. (Incidentally, Nielsen appears to be the source of the g-bar for transcribing a palatal voiced fricative in Lapp; modern analysts like Hasselbrink sensibly substitute a “j” for this sound. And as long as I am picking nits, Nielsen’s “g-bar” is actually a “g” with an underline strike-thru at the baseline, not the “g” with a short bar sticking out the side as shown in position 034/188 in 10646.)


Into the nearest dumpster, I hope. We are dealing here with a perfectly normal manifestation of European transcription of aspiration–as manifested in thousands of transcriptions of hundreds of languages. There is nothing specifically Lapp about it, and it has absolutely nothing to do with the ogonek.


Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: