The case of the curious power drain

October 4, 2008

I took my MacBook to the Nevada City 4th of July parade, to save some spots at Wisdom Cafe. This is your classic small-town event, with fire engines, jazzercisers and The German American Friendship Club in full lederhosen.

After about an hour of work, my battery life was down to less than an hour, instead of the 3-4 hours I was expecting. Hmm, maybe my battery is dying. But I can hear a faint humming noise, which means the hard disk is spinning continuously. And I had trouble putting my Mac to sleep before I left for the parade.

I fired up Activity Monitor, and saw that something called “ditto” was using up 95% of my CPU. Firing up the Terminal, I executed “ps auxwww | grep ditto” and got:

kenkrugler 15193  98.2  0.0    75624   1024   ??  Rs   Tue02PM 3942:21.74 /usr/bin/ditto -xk - /Users/kenkrugler/Desktop/.BAHO26HV

Executing “man ditto” in the Terminal tells me it’s used to “copy directory hierarchies, create and extract archives”. What’s odd is that it’s been running for 3942 minutes, or almost 3 days.

Then I remembered that a few days ago, I’d renamed a .jar file to .zip and tried to expand it, to show somebody the structure of a jar file. But the Archive Utility hung, or rather it refused to finish or quit. So I had to force-quit it.

Which apparently left the underlying “ditto” process running, and what looks like a fully expanded version of the .zip file in a hidden “.BAHO26HV” directory on my desktop. I restarted my Mac, deleted the directory, and everything returned to normal (less 3 hours of battery).

Mark Twain Commentary on the MacBook Air

April 5, 2008

One of my favorite posts to the Unicode mailing list came during a heated debate about “simplifying” certain character sets. I believe it was Joe Becker who re-posted Mark Twain’s humorous proposal for simplifying English spelling:

Mark Twain

In year 1, that useless letter “c” would be dropped to be replased either by “k” or “s;” and likewise, “x” would no longer be part of the alphabet. The only kase in which “c” ould be retained would be the “ch” formation, which will be dealth with later. Year 2 might reform the “w” spelling so that “which” and “one” would take the same konsonant wile year 3 might well abolish “y,” replasing it with “i;” and iear 4 might fiks the “g/j” anomali wonse and for all.

Jeneraly, then the improvement would kontinue iear bai iear with iear 5 doing awai with useless double konsonants; and iears 6-12 or so modifaiiing vowlz and the rimeining voist and unvoist kononants. Bai iear 15, it wud be fainali bi posibl tu meik ius ov thi ridandant letez “c,” “y,” and “x” — bai now jast a memori in the maindz ov ould doderez — tu replais “ch,” “sh,” and “th,” rispektivli.

Finali, xen, ater sam 20 iers ov orxogrefkl riform, wi wud hev a lojikl, kohirnt speling iniuse xrewawt xe Ingliy-Spiking werld.

I wonder what Mark Twain would have to say about the MacBook Air…

MacBook Air

A Plan for the Improvement of the PowerBook G4 12″ Laptop

For example, that useless Ethernet port would be dropped to be replaced by wireless only, and likewise the second USB and FireWire would no longer be available. The hard disk will be retained, but only the very slow iPod version or the very expensive flash version, as most people will no longer need to keep files other than system software on their computer.

Continuing our optimization, the DVD/CD-ROM drive is now useless, because there’s no space on the hard disk to install anything. And with the slower processor, the option to expand memory to 4GB is also unneeded as who would now do any heavy processing work with this computer?

Once the device has been tuned for email writing executives, the price can be increased to match their signature authority, thus eliminating problems caused by most other customers buying the product and complaining about limitations.

Finally, then, after extensive optimizations, we would have the perfect computer for our target customer, Steve Jobs.

Unicode and Excellence in Technical Research

March 21, 2008

While digging around in my musty Unicode mailing list archives, I can across a true classic.

For those of you who wonder what a quality technical mailing list post looks like, read Ken Whistler’s essay below on the “High Ogonek” character.

Side note – it’s of particular interest to me when I wound up doing the same kind of forensic character set research as part of my work on internationalizing the Mac OS. For me, the letter in question was the mythical “Y with diaeresis”, which had been faithfully ported to the Macintosh “Roman” character set from the Lisa character set.

But nobody really knew what language used it. Rumor in the hallways was that it somehow came into the Lisa character set from a Turkish character set. In the end there wasn’t sufficient information to declare it null and void, so we left it as-is.

Now you can actually do a Google search on ÿ, and find a Wikipedia article that references its use in Greek transcription and rare French place names like “L’Haÿ-les-Roses“, but nothing about Turkish.

On to Ken’s Opus, from 4 April 1991:

Warning to readers: This contribution contains real research, so if you haven’t got time to care, you can delete it now!

The “High Ogonek” has stuck in my craw for so long that I feel I must say something about it. The High Ogonek is symptomatic of one of the things wrong about the character standardization business, which encourages the blithe perpetuation of mistaken “characters” from standard to standard, like code viruses. At least, in the past, the epidemic was constrained by the fact that the encoding bodies only had 256 cells which could get infected by such abominations as half-integral signs. Now, however, with Unicode and ISO 10646 and the AFII registry, and other 2 byte corporate standards, the number of cells available for infection is vast, and the temptation to encode everybody else’s junk just seems to have become irresistible.


“High Ogonek” can be found in ISO DIS 10646 (JTC1/SC2/WG2 N666) at 034/126. What is it? Well, that’s a good question, and 10646 doesn’t provide a clue–but then it doesn’t say anything about where any of its content comes from. But for those in the know, the source of “High Ogonek” in the DIS 10646 can be tracked to ECMA/TC1/90/15, Latin Alphabet No. 6, and more specifically to Appendix A, which reproduces 34 characters “registered according to ISO 2375 as Registration No. 158”, for “text in the Skolt Lappish dialect, as well as texts using older Lappish orthography…” Position 03/00 in the code table of Registration No. 158 is our critter. So now we know what it is, right? Wrong. The ill-defined squiggle in position 03/00 does indeed look something like an ogonek (mistaken ogonek forms are themselves another tale of woe I won’t get into here), and the “ogonek” in 03/00 is indeed high in its box–hence the “High Ogonek” in DIS 10646, drawn in position 034/126 as a nondescript rightward hook.

Well, reviewers of 10646 have complained about “High Ogonek”, and something has indeed been done. In JTC1/SC2/WG2 N680 “Updated code table charts”, dated 22 March 1991, the “High Ogonek” has now been printed using a high reversed comma, quite sharply distinguished from the “Ogonek” at 033/178. In fact, it looks remarkably like an aspiration mark–hmmm. For those of you with long memories or big filing cabinets, the 2nd DP of 10646 had just such a thing at 171/072, labeled “IPA ASPIRATION MARK”, but all the IPA later disappeared in the DIS, just as the strange “High Ogonek” appeared.

N680 was “generated by AFII using their publishing system,” so it would behoove us to check whether the “High Ogonek” virus has spread to AFII–and guess what! The draft AFII registry has a new glyph id 043B/241B devoted especially to printing the 10646 “High Ogonek”. The AFII glyph looks like a high reversed comma, and is labeled:

“High ogonek” (not a non-spacing character, but rather a separate character within words) (Lapp)

That’s strange, because AFII has what appears to be the same glyph encoded at 342B/110B, labeled:

Aspirated, IPA

So AFII and 10646 seem to have decided these things are different. Welcome to the “High ogonek”.

What about Unicode? I don’t think I would be telling any tales out of school if I revealed that Unicode almost got a “High ogonek”, too, since Unicode was busy incorporating all the 10646 mistakes in Unicode while 10646 was busy incorporating all the Unicode mistakes in 10646. (Gives you an Excedrin headache, doesn’t it?) But some degree of reason has prevailed, and the Skolt Lappish “High Ogonek” is now simply mapped to Unicode U+02BD MODIFIER LETTER REVERSED COMMA (which is explicitly intended as the IPA aspiration mark).

Is that the right answer? Well, how about doing what should have been done in the first place–some research–instead of just citing other character standards like holy books.


Based on a fairly quick survey, I note three broad groups of treatment of Lappish transcription:

1. Prewar (pre World War II) publications using systems based on Finno-Ugrian practice (which itself is an offshoot of the transcription used by Indo-Europeanists). Non-phonemic, non-systematic phonetic, and inconsistently narrow transcription.

2. Early postwar publications. Systematic phonemic, but with a nod to old-fashioned transcription and IPA usages.

3. “Modern” publications (70’s and 80’s). Phonemic, with systematic phonetic realization rules, and with tuned practical orthographies. (E.g. “sj” for esh, rather than s-acute or s-hacek, etc.)

Going from best to worst, i.e. recent to early, we have the following facts.

In modern treatments, aspiration is not part of Lappish orthography. Why? I’ll let the best analyst explain it:

Die Verschlusslaute werden in phonetischer Hinsicht entweder als mehr oder weniger stimmhafte Lenes [b d g] oder als stimmlose Fortes realisiert. Die letzteren ko”nnen entweder unaspiriert [p t k], pra”aspiriert [hp(p) ht(t) hk(k)] oder postaspiriert [ph th kh] ausgesprochen werden.

(Su”dlappisches Wo”rterbuch, Gustav Hasselbrink, Uppsala 1981, Ab Lundequistska Bokhandeln, p. 42.) In other words (South) Lapp has a lenis and a fortis series of stops, and the fortis series may be either unaspirated, preaspirated (in geminate contexts) or postaspirated, depending on the context. Since degree of aspiration is predictable by context, it need not be represented in the orthography. However, when Hasselbrink wants to explicitly transcribe aspiration phonetically, he does so with an inline “h” or a raised “h”–the distinction being primarily whether phonological pattern or phonetic quality is in question.

G. M. Kert published a very similar analysis in Saamskii Yazyk, Leningrad 1971, Soviet Academy of Sciences. See, for example, the phonological chart on p. 63. (I won’t quote anything–Cyrillic in ASCII is too painful.)

The early postwar treatments of Lapp also use a standardized orthography for Lapp, with two stop series, but are sometimes hazier about the status of each series. They also tend to use the {raised reversed comma} to indicate aspiration explicitly. Examples are: Wo”rterbuch des Waldlappendialekts von Mala{ring} und Texte zur Ethnographie, Wolfgang Schlachter, Helsinki 1958, Suomalais- Ugrailainen Seura. Also: The Lappish Dialect of Jukkasjo”rvi, A Morphological Survey, Bjo”rn Collinder, Uppsala, 1949, Almqvist & Wiksells Boktryckeri Ab:

31. k, p, t are unaspirated (as c, p, t in French) if they are not followed by the sign [{raised reverse comma}] (see Section 59).
–p. 11

Then we get to the pre-phonemic transcriptions. These have no systematic understanding of phonological derivation and phonetic realization, and tend to have either broad or narrow “phonetic” orthographies, with symbols derived from Finno-Ugrian practice. Example 1: Lappisher Wortschatz, Eliel Lagercrantz, Helsinki, 1939, Suomalais-Ugrilainen Seura (2 vols.). This lexicon systematically transcribes aspiration, and does so with a {raised small cap h} after stop consonants.

Example 2 is a massive work, and represents the extreme of unsystematic narrow phonetic transcription: Lappisk Ordbok, Konrad Nielsen, Oslo 1962, Universitetsforlaget (5 vols.). Don’t let the date of publication fool you–the words were collected from 1906-1911, the compilation was begun in 1929, and the first signature was printed in 1930. Nielsen uses a plethora of diacritics for all kinds of things, since this is a cross-dialectal compilation. For explicit aspiration, he uses a {raised left half ring} (cf. Unicode U+02BF), which is a common Indo-European and/or Finno-Ugrian typographical substitute for the {raised reversed comma}. Since Nielsen also follows the Indo-European tradition of typesetting cited forms in italics, the {raised left half ring} also gets leaned over a bit and then is strongly kerned up over the “knee” of the “k”‘s or “h”‘s (yes!, aspirated “h”‘s), and nestles in above the cross-bar’s of the “t”‘s. So for the typesetter, these aspirated forms were probably a single piece of type, but the analysis clearly shows the {raised left half ring} to be, in principle, a “spacing” diacritic following a stop (or “h”).

My brief survey of these works did not turn up any specifically dealing with the “Skolt Lapp” dialect, but the general picture is clear. Aspirated phones do exist in Lappish dialects, and the aspiration has been traditionally transcribed using either a {raised reversed comma} or a typographical variant of that, the {raised left half ring}. The Skolt Lapp texts referred to in ECMA/TC1/90/15 presumably follow this orthographic tradition, influenced by Nielsen or other early analysts. Modern Lapp orthographies omit transcription of aspiration altogether. (Incidentally, Nielsen appears to be the source of the g-bar for transcribing a palatal voiced fricative in Lapp; modern analysts like Hasselbrink sensibly substitute a “j” for this sound. And as long as I am picking nits, Nielsen’s “g-bar” is actually a “g” with an underline strike-thru at the baseline, not the “g” with a short bar sticking out the side as shown in position 034/188 in 10646.)


Into the nearest dumpster, I hope. We are dealing here with a perfectly normal manifestation of European transcription of aspiration–as manifested in thousands of transcriptions of hundreds of languages. There is nothing specifically Lapp about it, and it has absolutely nothing to do with the ogonek.

WikiTrans round two

November 7, 2007

Back in January 2006, I wrote a blog post on the Krugle site titled WikiTrans – translation by the community.

Since then, I haven’t been able to spend any time even thinking about this topic, but two more recent sightings popped it back up in my mind.

First there was the October event at IMUG (International Mac Users Group), titled Collaborative Website Translation via the Worldwide Lexicon, by Brian McConnell. The short description is:

The Worldwide Lexicon Project is an open source, collaborative translation system for websites and publishers. The current version of WWL, available now for Word Press, Firefox and PHP based sites, enables a website’s readers, as well as volunteer or staff translators, to create, edit and share translations to and from almost any human language.

While this focuses on website translation, while I’m more interested in application translation, the goals are similar – how to create an open, collaborative system for translating (web sites, applications) from English into other languages.

And ideally doing it in a way where the results can automatically be used by the owner(s) of the original version to create/publish localized versions without much extra effort. As otherwise it just won’t happen.

Second, I noticed on the Chandler mailing list that there’s a big push for localization of the current version. And that, in turn, led to discussion about how to manage these translations. I wish I had the cycles to help out here, as it would be a fascinating project to create this type of system that could just plug into the Chandler development process.

What’s a techno tidbit?

November 6, 2007

A long time ago, I worked at the Apple Japan office while a small team of us where trying to kick KanjiTalk (Japanese OS for the Mac) out the door. There was a significant gap between most of the office, being focused on sales/marketing, and the development team. So I did a few talks on the technology behind KanjiTalk – Japanese text processing, input methods, fonts, etc.

I’m not sure how successful they were, but at least I can now recycle the name as the title for my blog.