Apple Passive-Aggressive Login

January 25, 2010

I logged into the Apple site recently, to make a Genius Bar appointment.

There seems to be some new information required, where they want the secret question/answer pair that now seems to be part of every company’s registration system. But after my login I got this interesting message:

Looks like a case of the right hand and the left hand not being in sync.


193MPH Volkswagen Van

January 5, 2010

While our EuroVan continues to provide for a local mechanic’s retirement fund, I found this article on “Tuners” included a much faster version of what we’re (hopefully) driving the Grand Canyon during spring break.

The section on this vehicle from the article I linked to above says it all…

Winning first prize in the “You’ve got to be kidding me!” category was TH Auto­mobile’s TH2 RS. What’s wacky about it? Well, what started life as a pedestrian Volkswagen T5 van has been made into The World’s Fastest Brick.

First, TH Automobile swapped the engine from the front to the rear. But instead of a VW unit, TH dropped in a Porsche twin-turbo flat-6 breathed on by 9ff to produce 800 bhp. The rear axle and 6-speed manual transmission come straight from Porsche, as do the brakes.

The interior was also completely remodeled, the driver’s position switched to a central location, along with four carbon-fiber racing buckets for passengers. To handle the TH2 RS’s aero-defying speed of 193.1 mph (breaking the previous van record of 169.6 mph, set by a Claer-tuned T4 VW van), H&R provided an air suspension system that adjusts the ride height among three different levels depending on speed. TH claims the van can hit 62 mph in just 4.5 sec. A customer version would cost somewhere north of $225,000.

At almost 200MPH, that would get us to the South Rim in about, let’s see, 4 hours. Though we’d have to remove all of the camping accessories, move the engine from the front to the back, pay $225K, etc., etc., etc. But the look on driver’s faces as we sucked their doors off might make it all worth while.


Emmett isn’t a mutt, he’s a lurcher!

January 3, 2010

We adopted Emmett from AnimalSave back in October 2005, and he’s been a great member of the Krugler pack. He seems to be a mix of sighthound and Labrador – in other words, he’s a mutt.

But one day, while Jenna and I were speculating about what kind of sighthound would give him his deep chest and curled tail, I did a search on “greyhound labrador mix”, and found out that we’d been wrong all these years.

He’s not a mutt, he’s a lurcher!

What’s a lurcher? Well, according to Wikipedia (source of all truth and goodness) a lurcher is:

a hardy, crossbred sighthound, generally a cross between a sighthound and any other breed…the lurcher was bred in Ireland and Great Britain by the Irish Gypsies and travellers in the 17th century. They were used for poaching rabbits, hares and other small creatures. The name lurcher is derived from the Romani language word lur, which means thief.

There’s even a new group called the North American Lurcher & Longdog Association. It’s a bit hard to tell, but I think Emmett is very excited about the possibility of membership.


Three suggestions for the Mac Finder’s Force Quit command

December 24, 2009

Unfortunately I have to use this several times a week – mostly for Microsoft Word and PowerPoint, but occasionally a few other equally troubled apps.

So in the process, I’ve found a few irritations that could easily be fixed:

  1. Display the list of apps in two groups – at the top, the apps that aren’t responding (usually just one). So then I never see the list displayed without the offending app clearly at the top. When you’re running a lot of apps, the list often winds up being displayed such that the hung app isn’t visible.
  2. When I force-quit a non-responding app, don’t ask for confirmation. It’s not responding, that’s why I’m explicitly asking you to make it go away.
  3. And in the same vein, when I’ve force quit a non-responding app, don’t display a scary dialog telling me that an app unexpectedly quit, and whether I want to report the problem to Apple.

Getting a category-specific RSS feed to a WordPress blog

December 21, 2009

I poked around a bit, and didn’t find any direct info on how to do this, so here’s the results of my research.

If you have a WordPress.com-hosted blog (like this one), and you use categories, then the RSS feed is:

http://<domain>/category/<category name>/feed/

For example, the RSS feed for things I’ve categorized as being about “Nevada City” is http://ken-blog.krugler.org/category/nevada-city/feed/


The web is an endless series of edge cases

December 17, 2009

Recently I’d been exchanging emails with Jimmy Lin at CMU. Jimmy has written up some great Hadoop info, and provided some useful classes for working with the ClueWeb09 dataset.

In one of his emails, he said:

However, what I’ve learned is that whenever you’re working with web-scale collections, it exposes bugs in otherwise seemingly solid code.  Sometimes it’s not bugs, but rather obscure corner cases that don’t happen for the most part.  Screwy data is inevitable…

I borrowed his “screwy data is inevitable” line for the talk I gave at December’s ACM data mining SIG event, and added a comment about this being the reason for having to write super-defensive code when implementing anything that touched the web.

Later that same week, I was debugging a weird problem with my Elastic MapReduce web crawling job for the Public Terabyte Datset project. At some point during one of the steps, I was getting LeaseExpiredExceptions in the logs, and the job was failing. I posted details to the Hadoop list, and got one response from Jason Venner about a similar problem he’d run into.

Is it possible that this is occurring in a task that is being killed by the framework. Sometimes there is a little lag, between the time the tracker ‘kills a task’ and the task fully dies, you could be getting into a situation like that where the task is in the process of dying but the last write is still in progress.
I see this situation happen when the task tracker machine is heavily loaded. In once case there was a 15 minute lag between the timestamp in the tracker for killing task XYZ, and the task actually going away.

It took me a while to work this out as I had to merge the tracker and task logs by time to actually see the pattern. The host machines where under very heavy io pressure, and may have been paging also. The code and configuration issues that triggered this have been resolved, so I don’t see it anymore.

This led me down the path of increasing the size of my master instance (I was incorrectly using m1.small with a 50 server cluster), increasing the number of tasktracker.http.threads from 20 to 100, etc. All good things, but nothing that fixed the problem.

However Jason’s email about merging multiple logs by timestamp value led me to go through all of the logs in more detail. And this led me to the realization that the job previous to where I was seeing a LeaseExpiredException had actually died quite suddenly. I then checked the local logs I wrote out, and I saw that this was right after a statement about parsing an “unusual” file from stanford.edu: http://library.stanford.edu/depts/ssrg/polisci/NGO_files/oledata.mso

The server returns “text/plain” for this file, when in fact it’s a Microsoft Office document. I filter out everything that’s not plain text or HTML, which lets me exclude a bunch of huge Microsoft-specific parse support jars from my Hadoop job jar. When you’re repeatedly pushing jars to S3 via a thin DSL connection, saving 20MB is significant.

But since the server lies like a rug in this case, I pass it on through to the Tika AutoDectectParser. And that in turn correctly figures out that it’s a Microsoft Office document, and makes a call to a non-existing method. Which throws a NoSuchMethodError (not an Exception!). Since it’s an Error, this flies right on by all of the exception catch blocks, and kills the job.

Looks like I need to get better at following my own advice – a bit of defensive programming would have saved me endless hours of debugging and config-thrashing.


Git and unreferenced blobs and Stack Overflow

December 16, 2009

I ran into a problem yesterday, while trying to prune the size of the Bixo repo at GitHub (450MB, ouch).

I deleted the release branch, first on GitHub and then locally. This is what contained a bunch of big binary blobs (release distribution jars). But even after this work, I still had 250MB+ in my local & GitHub repository.

Following some useful steps I found on Stack Overflow, I could isolate the problem down to a few unreferenced blobs. By “unreferenced” I mean these were blobs with SHA1s that could not be located anywhere in the git tree/history by the various scripts I found on Stack Overflow.

I posted a question about this to Stack Overflow, and got some very useful answers, though nothing that directly solved the problem. But it turns out that a fresh clone from GitHub is much smaller, and these dangling blobs are gone. So I think it’s a git bug, where these blobs get left around locally but are correctly cleared from the remote repo.

But I ran into a new problem today with my local Bixo repo, where I couldn’t push changes. I’d get this output from my “git push” command:

Counting objects: 92, done.
Delta compression using 2 threads.
Compressing objects: 100% (53/53), done.
Writing objects: 100% (57/57), 11.50 KiB, done.
Total 57 (delta 28), reused 0 (delta 0)
error: insufficient permission for adding an object to repository database ./objects

fatal: failed to write object
error: unpack-objects exited with error code 128
error: unpack failed: unpack-objects abnormal exit
To git@github.com:bixo/bixo.git
 ! [remote rejected] master -> master (n/a (unpacker error))
error: failed to push some refs to 'git@github.com:bixo/bixo.git'

No solutions came up while searching, but the problem doesn’t exist for a fresh clone, so I’m manually migrating my changes over to the fresh copy, and then I’ll happily delete my apparently messed up older local git repo and move on to more productive uses of my time.

[UPDATE: The problem does actually exist in a fresh clone, but only for the second push. Eventually GitHub support resolved the issue by fixing permissions of some files on their side of the fence. Apparently things got "messed up" during the fork from the original EMI/bixo repo]


Why fetching web pages doesn’t map well to map-reduce

December 12, 2009

While working on Bixo, I spent a fair amount of time trying to figure out how to avoid the multi-threaded complexity and memory-usage issues of the FetcherBuffer class that I wound up writing.

The FetcherBuffer takes care of setting up queues of URLs to be politely fetched, with one queue for each unique <IP address>+<crawl delay> combination. Then a queue of these queues is managed by the FetcherQueueMgr, which works with a thread pool to provide groups of URLs to be fetched by an available thread, when enough time has gone by since the last request to be considered polite.

But this approach means that in the reducer phase of a map-reduce job you have to create these queues, and then wait in the completion phase of the operation until all of them have been processed. Running multiple threads creates complexity and memory issues due to native memory stack space requirements, and having in-memory queues of URLs creates additional memory pressure.

So why can’t we just use Hadoop’s map-reduce support to handle all of this for us?

The key problem is that MR works well when each operation on a key/value pair is independent of any other key/value, and there are no external resource constraints.

But neither of those is true, especially during polite fetching.

For example, let’s say you implemented a mapper that created groups of 10 URLs, where each group was for the same server. You could easily process these groups in a reducer operation. This approach has two major problems, however.

First, you can’t control the interval between when groups for the same server would be processed. So you can wind up hitting a server to fetch URLs from a second group before enough time has expired to be considered polite, or worse yet you could have multiple threads hitting the same server at the same time.

Second, the maximum amount of parallelization would be equal to the number of reducers, which typically is something close to the number of ccores (servers * cores/server). So on a 10 server cluster w/dual cores, you’d have 20 threads active. But since most of the time during a fetch is spent waiting for the server to respond, you’re getting very low utilization of your available hardware & bandwidth. In Bixo, for example, a typical configuration is 300 threads/reducer.

Much of web crawling/mining maps well to a Hadoop map-reduce architecture, but fetching web pages unfortunately is a square peg in a round hole.


Using WordPress for web site but keeping mail separate

November 19, 2009

I use WordPress.com to host a number of web sites, and for simple stuff it’s great.

But I ran into a problem with keeping email separate, so I thought I’d share what I learned.

Here’s the background. I wanted to have http://bixolabs.com and http://www.bixolabs.com both wind up at the web site being hosted by WordPress.com. But I wanted to keep my email separate, versus using the GMail-only approach supported by WordPress.

According to WordPress documentation, you can’t do this. They say:

Changing the name servers will make any previously setup custom DNS records such as A, CNAME, or MX records stop working, and we do not have an option for you to create custom DNS records here. If you already have email configured on your domain, you must either switch to Custom Email with Google Apps or you can use a subdomain instead which doesn’t require changing the name servers.

This meant that I couldn’t just change my name server to WordPress, as they don’t support any customization.

But if I keep my own DNS configuration, then all I can do is use a CNAME record to map a subdomain to WordPress. And you can’t treat “www” as a subdomain.

So my first attempt was to configure my DNS record as follows:

  • www -> [URL redirect] -> http://bixolabs.com
  • @ -> [CNAME] -> bixolabs.wordpress.com
  • @ -> [MX] -> <my hoster’s mail server IP address>

This worked pretty well. www.bixolabs.com got redirected to bixolabs.com, and bixolabs.com mapped to the bixolabs site at WordPress.com.

But the www.bixolabs.com redirect was a temp redirect (HTTP 302 status) not a permanent redirect (HTTP 301 status), so I was losing some SEO “juice” due to how Google and others interpret temp vs. perm redirects.

I fixed this by having my hoster set up their Apache server to do a permanent redirect, and changing the entry for www to point to the Apache server’s IP address.

But there was a bigger, hidden problem. Occasionally people would complain about getting email bounces, when they tried to reply to one of my emails. The reply-to address in my email would be ken@bixolabs.com, but the To: field in their reply would be set to ken@lb.wordpress.com.

Eventually I figured out the problem. It’s technically not valid to have both a CNAME and an MX DNS entry for the same domain (or sub-domain, I assume). If a mail client does a lookup on the reply-to domain, bixolabs.com has the canonical address of “lb.wordpress.com”, since the CNAME entry overrides the MX entry.

The fix for this involved three steps. First, I changed the MX entry in my DNS setup to use “mail”, not “@”. Then I changed my email client reply-to address to use mail.bixolabs.com, not just bixolabs.com. And finally, my hoster had to configure their mail server to recognize mail.bixolabs.com as a valid domain, not just bixolabs.com.

 


Wikipedia Love

November 16, 2009
Wikipedia Affiliate Button Normally we wait until the end of the year to figure out our charitable donations, but I’ve been using Wikipedia so much over the past few days that I felt like I needed to donate today.