Git and unreferenced blobs and Stack Overflow

December 16, 2009

I ran into a problem yesterday, while trying to prune the size of the Bixo repo at GitHub (450MB, ouch).

I deleted the release branch, first on GitHub and then locally. This is what contained a bunch of big binary blobs (release distribution jars). But even after this work, I still had 250MB+ in my local & GitHub repository.

Following some useful steps I found on Stack Overflow, I could isolate the problem down to a few unreferenced blobs. By “unreferenced” I mean these were blobs with SHA1s that could not be located anywhere in the git tree/history by the various scripts I found on Stack Overflow.

I posted a question about this to Stack Overflow, and got some very useful answers, though nothing that directly solved the problem. But it turns out that a fresh clone from GitHub is much smaller, and these dangling blobs are gone. So I think it’s a git bug, where these blobs get left around locally but are correctly cleared from the remote repo.

But I ran into a new problem today with my local Bixo repo, where I couldn’t push changes. I’d get this output from my “git push” command:

Counting objects: 92, done.
Delta compression using 2 threads.
Compressing objects: 100% (53/53), done.
Writing objects: 100% (57/57), 11.50 KiB, done.
Total 57 (delta 28), reused 0 (delta 0)
error: insufficient permission for adding an object to repository database ./objects

fatal: failed to write object
error: unpack-objects exited with error code 128
error: unpack failed: unpack-objects abnormal exit
To git@github.com:bixo/bixo.git
 ! [remote rejected] master -> master (n/a (unpacker error))
error: failed to push some refs to 'git@github.com:bixo/bixo.git'

No solutions came up while searching, but the problem doesn’t exist for a fresh clone, so I’m manually migrating my changes over to the fresh copy, and then I’ll happily delete my apparently messed up older local git repo and move on to more productive uses of my time.

[UPDATE: The problem does actually exist in a fresh clone, but only for the second push. Eventually GitHub support resolved the issue by fixing permissions of some files on their side of the fence. Apparently things got “messed up” during the fork from the original EMI/bixo repo]

Advertisements

Fixing Firefox default monitor

October 26, 2009

I’m running Firefox 3.0.14 on Mac OS X 10.5.

I’ve got a MacBook laptop and a 24″ LCD display as my normal configuration, though sometimes I’m just using the laptop.

Whenever I open a new browser window, it defaults to the laptop display, not the big LCD, even though that’s my main screen.

I searched the forums, and didn’t find any good solution, so here’s what worked for me:

  1. Quit Firefox
  2. Locate the localstore.rdf file in your Firefox profile directory. This will be in the ~/Library/Application Support/Firefox/Profiles/<random string>.default/ directory.
  3. Open it with your favorite text editor.
  4. Find the RDF section with the description set to “chrome://browser/content/browser.xul#main-window”
  5. Set the screenX and screenY values to 0.
  6. Save the file.
  7. Restart Firefox

In my case, for example, the prior contents of this file were:

<RDF:Description RDF:about="chrome://browser/content/browser.xul#main-window"
    height="778"
    screenX="-1273"
    screenY="401"
    width="1276"
    sizemode="maximized" />

By setting screenX=”0″ and screenY=”0″, I was able to fix my problem.


Performance problems with vertical/focused web crawling

May 19, 2009

Over at the Nutch mailing list, there are regular posts complaining about the performance of the new queue-based fetcher (aka Fetcher2) that became the default fetcher when Nutch 1.0 was released. For example:

Not sure if that problem is solved, I have it and reported it in a previous thread. Extremely fast fetch at the beginning and damn slow fetches after a while.

There’s also a Jira issue (NUTCH-721) filed on the problem.

But in my experience using Nutch to do vertical/focused crawls, this problem of having very slow fetch performance at the end of a crawl is a fundamental problem caused by not enough unique domains. If a crawler is polite, then once the number of unique domains drops significantly (because you’ve fetched all of the URLs for most of the domains), the fetch performance always drops rapidly, at least if your crawler is properly obeying robots.txt and the default rules for polite crawling.

Just for grins, I tracked a number of metrics at the tail end of a vertical crawl I was just doing using Bixo – that’s the vertical crawler toolkit I’ve been working on for the past two months. The system configuration (in Amazon’s EC2) is an 11 server cluster (1 master, 10 slaves) using the small EC2 instance. I run 2 reducers per server, with a maximum of 200 fetcher threads per reducer. So the theoretical maximum is 4000 active fetch threads, which is way more than I needed, but I was also testing memory usage (primarily kernel memory) of threads, so I’d cranked this way up.

I started out with 1,264,539 URLs from 41,978 unique domains, where I classify domains using the “paid level” ontology as described in the IRLbot paper. So http://www.ibm.com, blogs.us.ibm.com, and ibm.com are all the same domain.

Here’s the performance graph after one hour, which is when the crawl seemed to enter the “long tail” fetch phase…

Fetch Performance

The key things to note from this graph are:

  • The 41K unique domains were down to 1700 after an hour, and then slowly continued to drop. This directly impacts the number of simultaneous fetches that can politely execute at the same time. In fact there were only 240 parallel fetches (== 240 domains) after an hour, and 64 after three hours.
  • Conversely, the average number of URLs per domain climbs steadily, which means the future fetch rate will continue to drop.
  • And so it does, going from almost 9K/second (scaled to 10ths of second in the graph) after one hour down to 7K/second after four hours.

I think this represents a typical vertical/focused crawl, where a graph of the number of URLs/domain would show a very strong exponential decay. So once you’ve fetched the single URLs from a lot of different domains, you’re left with lots of URLs for a much smaller number of domains. And your performance will begin to stink.

The solution I’m using in Bixo is to specify the target fetch duration. From this, I can estimate the number of URLs per domain I might be able to get, and so I pre-prune the URLs put into each domain’s fetch queue. This works well for the type of data processing workflow that the current commercial users of Bixo need, where Bixo is a piece in the data processing pipeline that needs to play well (ie doesn’t stall the entire process).

Anyway, I keep thinking that perhaps some of the reported problems with Nutch’s Fetcher2 are actually a sign that the fetcher is being appropriately polite, and the comparison with the old fetcher is flawed because that version had bugs where it would act impolitely.


The case of the curious power drain

October 4, 2008

I took my MacBook to the Nevada City 4th of July parade, to save some spots at Wisdom Cafe. This is your classic small-town event, with fire engines, jazzercisers and The German American Friendship Club in full lederhosen.

After about an hour of work, my battery life was down to less than an hour, instead of the 3-4 hours I was expecting. Hmm, maybe my battery is dying. But I can hear a faint humming noise, which means the hard disk is spinning continuously. And I had trouble putting my Mac to sleep before I left for the parade.

I fired up Activity Monitor, and saw that something called “ditto” was using up 95% of my CPU. Firing up the Terminal, I executed “ps auxwww | grep ditto” and got:

kenkrugler 15193  98.2  0.0    75624   1024   ??  Rs   Tue02PM 3942:21.74 /usr/bin/ditto -xk - /Users/kenkrugler/Desktop/.BAHO26HV

Executing “man ditto” in the Terminal tells me it’s used to “copy directory hierarchies, create and extract archives”. What’s odd is that it’s been running for 3942 minutes, or almost 3 days.

Then I remembered that a few days ago, I’d renamed a .jar file to .zip and tried to expand it, to show somebody the structure of a jar file. But the Archive Utility hung, or rather it refused to finish or quit. So I had to force-quit it.

Which apparently left the underlying “ditto” process running, and what looks like a fully expanded version of the .zip file in a hidden “.BAHO26HV” directory on my desktop. I restarted my Mac, deleted the directory, and everything returned to normal (less 3 hours of battery).


Mark Twain Commentary on the MacBook Air

April 5, 2008

One of my favorite posts to the Unicode mailing list came during a heated debate about “simplifying” certain character sets. I believe it was Joe Becker who re-posted Mark Twain’s humorous proposal for simplifying English spelling:

Mark Twain

In year 1, that useless letter “c” would be dropped to be replased either by “k” or “s;” and likewise, “x” would no longer be part of the alphabet. The only kase in which “c” ould be retained would be the “ch” formation, which will be dealth with later. Year 2 might reform the “w” spelling so that “which” and “one” would take the same konsonant wile year 3 might well abolish “y,” replasing it with “i;” and iear 4 might fiks the “g/j” anomali wonse and for all.

Jeneraly, then the improvement would kontinue iear bai iear with iear 5 doing awai with useless double konsonants; and iears 6-12 or so modifaiiing vowlz and the rimeining voist and unvoist kononants. Bai iear 15, it wud be fainali bi posibl tu meik ius ov thi ridandant letez “c,” “y,” and “x” — bai now jast a memori in the maindz ov ould doderez — tu replais “ch,” “sh,” and “th,” rispektivli.

Finali, xen, ater sam 20 iers ov orxogrefkl riform, wi wud hev a lojikl, kohirnt speling iniuse xrewawt xe Ingliy-Spiking werld.

I wonder what Mark Twain would have to say about the MacBook Air…

MacBook Air

A Plan for the Improvement of the PowerBook G4 12″ Laptop

For example, that useless Ethernet port would be dropped to be replaced by wireless only, and likewise the second USB and FireWire would no longer be available. The hard disk will be retained, but only the very slow iPod version or the very expensive flash version, as most people will no longer need to keep files other than system software on their computer.

Continuing our optimization, the DVD/CD-ROM drive is now useless, because there’s no space on the hard disk to install anything. And with the slower processor, the option to expand memory to 4GB is also unneeded as who would now do any heavy processing work with this computer?

Once the device has been tuned for email writing executives, the price can be increased to match their signature authority, thus eliminating problems caused by most other customers buying the product and complaining about limitations.

Finally, then, after extensive optimizations, we would have the perfect computer for our target customer, Steve Jobs.


Unicode and Excellence in Technical Research

March 21, 2008

While digging around in my musty Unicode mailing list archives, I can across a true classic.

For those of you who wonder what a quality technical mailing list post looks like, read Ken Whistler’s essay below on the “High Ogonek” character.

Side note – it’s of particular interest to me when I wound up doing the same kind of forensic character set research as part of my work on internationalizing the Mac OS. For me, the letter in question was the mythical “Y with diaeresis”, which had been faithfully ported to the Macintosh “Roman” character set from the Lisa character set.

But nobody really knew what language used it. Rumor in the hallways was that it somehow came into the Lisa character set from a Turkish character set. In the end there wasn’t sufficient information to declare it null and void, so we left it as-is.

Now you can actually do a Google search on ÿ, and find a Wikipedia article that references its use in Greek transcription and rare French place names like “L’Haÿ-les-Roses“, but nothing about Turkish.

On to Ken’s Opus, from 4 April 1991:

Warning to readers: This contribution contains real research, so if you haven’t got time to care, you can delete it now!

The “High Ogonek” has stuck in my craw for so long that I feel I must say something about it. The High Ogonek is symptomatic of one of the things wrong about the character standardization business, which encourages the blithe perpetuation of mistaken “characters” from standard to standard, like code viruses. At least, in the past, the epidemic was constrained by the fact that the encoding bodies only had 256 cells which could get infected by such abominations as half-integral signs. Now, however, with Unicode and ISO 10646 and the AFII registry, and other 2 byte corporate standards, the number of cells available for infection is vast, and the temptation to encode everybody else’s junk just seems to have become irresistible.

WHENCE HIGH OGONEK?

“High Ogonek” can be found in ISO DIS 10646 (JTC1/SC2/WG2 N666) at 034/126. What is it? Well, that’s a good question, and 10646 doesn’t provide a clue–but then it doesn’t say anything about where any of its content comes from. But for those in the know, the source of “High Ogonek” in the DIS 10646 can be tracked to ECMA/TC1/90/15, Latin Alphabet No. 6, and more specifically to Appendix A, which reproduces 34 characters “registered according to ISO 2375 as Registration No. 158”, for “text in the Skolt Lappish dialect, as well as texts using older Lappish orthography…” Position 03/00 in the code table of Registration No. 158 is our critter. So now we know what it is, right? Wrong. The ill-defined squiggle in position 03/00 does indeed look something like an ogonek (mistaken ogonek forms are themselves another tale of woe I won’t get into here), and the “ogonek” in 03/00 is indeed high in its box–hence the “High Ogonek” in DIS 10646, drawn in position 034/126 as a nondescript rightward hook.

Well, reviewers of 10646 have complained about “High Ogonek”, and something has indeed been done. In JTC1/SC2/WG2 N680 “Updated code table charts”, dated 22 March 1991, the “High Ogonek” has now been printed using a high reversed comma, quite sharply distinguished from the “Ogonek” at 033/178. In fact, it looks remarkably like an aspiration mark–hmmm. For those of you with long memories or big filing cabinets, the 2nd DP of 10646 had just such a thing at 171/072, labeled “IPA ASPIRATION MARK”, but all the IPA later disappeared in the DIS, just as the strange “High Ogonek” appeared.

N680 was “generated by AFII using their publishing system,” so it would behoove us to check whether the “High Ogonek” virus has spread to AFII–and guess what! The draft AFII registry has a new glyph id 043B/241B devoted especially to printing the 10646 “High Ogonek”. The AFII glyph looks like a high reversed comma, and is labeled:

“High ogonek” (not a non-spacing character, but rather a separate character within words) (Lapp)

That’s strange, because AFII has what appears to be the same glyph encoded at 342B/110B, labeled:

Aspirated, IPA

So AFII and 10646 seem to have decided these things are different. Welcome to the “High ogonek”.

What about Unicode? I don’t think I would be telling any tales out of school if I revealed that Unicode almost got a “High ogonek”, too, since Unicode was busy incorporating all the 10646 mistakes in Unicode while 10646 was busy incorporating all the Unicode mistakes in 10646. (Gives you an Excedrin headache, doesn’t it?) But some degree of reason has prevailed, and the Skolt Lappish “High Ogonek” is now simply mapped to Unicode U+02BD MODIFIER LETTER REVERSED COMMA (which is explicitly intended as the IPA aspiration mark).

Is that the right answer? Well, how about doing what should have been done in the first place–some research–instead of just citing other character standards like holy books.

TRANSCRIPTION OF ASPIRATION IN LAPPISH

Based on a fairly quick survey, I note three broad groups of treatment of Lappish transcription:

1. Prewar (pre World War II) publications using systems based on Finno-Ugrian practice (which itself is an offshoot of the transcription used by Indo-Europeanists). Non-phonemic, non-systematic phonetic, and inconsistently narrow transcription.

2. Early postwar publications. Systematic phonemic, but with a nod to old-fashioned transcription and IPA usages.

3. “Modern” publications (70’s and 80’s). Phonemic, with systematic phonetic realization rules, and with tuned practical orthographies. (E.g. “sj” for esh, rather than s-acute or s-hacek, etc.)

Going from best to worst, i.e. recent to early, we have the following facts.

In modern treatments, aspiration is not part of Lappish orthography. Why? I’ll let the best analyst explain it:

Die Verschlusslaute werden in phonetischer Hinsicht entweder als mehr oder weniger stimmhafte Lenes [b d g] oder als stimmlose Fortes realisiert. Die letzteren ko”nnen entweder unaspiriert [p t k], pra”aspiriert [hp(p) ht(t) hk(k)] oder postaspiriert [ph th kh] ausgesprochen werden.

(Su”dlappisches Wo”rterbuch, Gustav Hasselbrink, Uppsala 1981, Ab Lundequistska Bokhandeln, p. 42.) In other words (South) Lapp has a lenis and a fortis series of stops, and the fortis series may be either unaspirated, preaspirated (in geminate contexts) or postaspirated, depending on the context. Since degree of aspiration is predictable by context, it need not be represented in the orthography. However, when Hasselbrink wants to explicitly transcribe aspiration phonetically, he does so with an inline “h” or a raised “h”–the distinction being primarily whether phonological pattern or phonetic quality is in question.

G. M. Kert published a very similar analysis in Saamskii Yazyk, Leningrad 1971, Soviet Academy of Sciences. See, for example, the phonological chart on p. 63. (I won’t quote anything–Cyrillic in ASCII is too painful.)

The early postwar treatments of Lapp also use a standardized orthography for Lapp, with two stop series, but are sometimes hazier about the status of each series. They also tend to use the {raised reversed comma} to indicate aspiration explicitly. Examples are: Wo”rterbuch des Waldlappendialekts von Mala{ring} und Texte zur Ethnographie, Wolfgang Schlachter, Helsinki 1958, Suomalais- Ugrailainen Seura. Also: The Lappish Dialect of Jukkasjo”rvi, A Morphological Survey, Bjo”rn Collinder, Uppsala, 1949, Almqvist & Wiksells Boktryckeri Ab:

31. k, p, t are unaspirated (as c, p, t in French) if they are not followed by the sign [{raised reverse comma}] (see Section 59).
–p. 11

Then we get to the pre-phonemic transcriptions. These have no systematic understanding of phonological derivation and phonetic realization, and tend to have either broad or narrow “phonetic” orthographies, with symbols derived from Finno-Ugrian practice. Example 1: Lappisher Wortschatz, Eliel Lagercrantz, Helsinki, 1939, Suomalais-Ugrilainen Seura (2 vols.). This lexicon systematically transcribes aspiration, and does so with a {raised small cap h} after stop consonants.

Example 2 is a massive work, and represents the extreme of unsystematic narrow phonetic transcription: Lappisk Ordbok, Konrad Nielsen, Oslo 1962, Universitetsforlaget (5 vols.). Don’t let the date of publication fool you–the words were collected from 1906-1911, the compilation was begun in 1929, and the first signature was printed in 1930. Nielsen uses a plethora of diacritics for all kinds of things, since this is a cross-dialectal compilation. For explicit aspiration, he uses a {raised left half ring} (cf. Unicode U+02BF), which is a common Indo-European and/or Finno-Ugrian typographical substitute for the {raised reversed comma}. Since Nielsen also follows the Indo-European tradition of typesetting cited forms in italics, the {raised left half ring} also gets leaned over a bit and then is strongly kerned up over the “knee” of the “k”‘s or “h”‘s (yes!, aspirated “h”‘s), and nestles in above the cross-bar’s of the “t”‘s. So for the typesetter, these aspirated forms were probably a single piece of type, but the analysis clearly shows the {raised left half ring} to be, in principle, a “spacing” diacritic following a stop (or “h”).

My brief survey of these works did not turn up any specifically dealing with the “Skolt Lapp” dialect, but the general picture is clear. Aspirated phones do exist in Lappish dialects, and the aspiration has been traditionally transcribed using either a {raised reversed comma} or a typographical variant of that, the {raised left half ring}. The Skolt Lapp texts referred to in ECMA/TC1/90/15 presumably follow this orthographic tradition, influenced by Nielsen or other early analysts. Modern Lapp orthographies omit transcription of aspiration altogether. (Incidentally, Nielsen appears to be the source of the g-bar for transcribing a palatal voiced fricative in Lapp; modern analysts like Hasselbrink sensibly substitute a “j” for this sound. And as long as I am picking nits, Nielsen’s “g-bar” is actually a “g” with an underline strike-thru at the baseline, not the “g” with a short bar sticking out the side as shown in position 034/188 in 10646.)

WHITHER HIGH OGONEK

Into the nearest dumpster, I hope. We are dealing here with a perfectly normal manifestation of European transcription of aspiration–as manifested in thousands of transcriptions of hundreds of languages. There is nothing specifically Lapp about it, and it has absolutely nothing to do with the ogonek.


WikiTrans round two

November 7, 2007

Back in January 2006, I wrote a blog post on the Krugle site titled WikiTrans – translation by the community.

Since then, I haven’t been able to spend any time even thinking about this topic, but two more recent sightings popped it back up in my mind.

First there was the October event at IMUG (International Mac Users Group), titled Collaborative Website Translation via the Worldwide Lexicon, by Brian McConnell. The short description is:

The Worldwide Lexicon Project is an open source, collaborative translation system for websites and publishers. The current version of WWL, available now for Word Press, Firefox and PHP based sites, enables a website’s readers, as well as volunteer or staff translators, to create, edit and share translations to and from almost any human language.

While this focuses on website translation, while I’m more interested in application translation, the goals are similar – how to create an open, collaborative system for translating (web sites, applications) from English into other languages.

And ideally doing it in a way where the results can automatically be used by the owner(s) of the original version to create/publish localized versions without much extra effort. As otherwise it just won’t happen.

Second, I noticed on the Chandler mailing list that there’s a big push for localization of the current version. And that, in turn, led to discussion about how to manage these translations. I wish I had the cycles to help out here, as it would be a fascinating project to create this type of system that could just plug into the Chandler development process.