The web is an endless series of edge cases

Recently I’d been exchanging emails with Jimmy Lin at CMU. Jimmy has written up some great Hadoop info, and provided some useful classes for working with the ClueWeb09 dataset.

In one of his emails, he said:

However, what I’ve learned is that whenever you’re working with web-scale collections, it exposes bugs in otherwise seemingly solid code.  Sometimes it’s not bugs, but rather obscure corner cases that don’t happen for the most part.  Screwy data is inevitable…

I borrowed his “screwy data is inevitable” line for the talk I gave at December’s ACM data mining SIG event, and added a comment about this being the reason for having to write super-defensive code when implementing anything that touched the web.

Later that same week, I was debugging a weird problem with my Elastic MapReduce web crawling job for the Public Terabyte Datset project. At some point during one of the steps, I was getting LeaseExpiredExceptions in the logs, and the job was failing. I posted details to the Hadoop list, and got one response from Jason Venner about a similar problem he’d run into.

Is it possible that this is occurring in a task that is being killed by the framework. Sometimes there is a little lag, between the time the tracker ‘kills a task’ and the task fully dies, you could be getting into a situation like that where the task is in the process of dying but the last write is still in progress.
I see this situation happen when the task tracker machine is heavily loaded. In once case there was a 15 minute lag between the timestamp in the tracker for killing task XYZ, and the task actually going away.

It took me a while to work this out as I had to merge the tracker and task logs by time to actually see the pattern. The host machines where under very heavy io pressure, and may have been paging also. The code and configuration issues that triggered this have been resolved, so I don’t see it anymore.

This led me down the path of increasing the size of my master instance (I was incorrectly using m1.small with a 50 server cluster), increasing the number of tasktracker.http.threads from 20 to 100, etc. All good things, but nothing that fixed the problem.

However Jason’s email about merging multiple logs by timestamp value led me to go through all of the logs in more detail. And this led me to the realization that the job previous to where I was seeing a LeaseExpiredException had actually died quite suddenly. I then checked the local logs I wrote out, and I saw that this was right after a statement about parsing an “unusual” file from

The server returns “text/plain” for this file, when in fact it’s a Microsoft Office document. I filter out everything that’s not plain text or HTML, which lets me exclude a bunch of huge Microsoft-specific parse support jars from my Hadoop job jar. When you’re repeatedly pushing jars to S3 via a thin DSL connection, saving 20MB is significant.

But since the server lies like a rug in this case, I pass it on through to the Tika AutoDectectParser. And that in turn correctly figures out that it’s a Microsoft Office document, and makes a call to a non-existing method. Which throws a NoSuchMethodError (not an Exception!). Since it’s an Error, this flies right on by all of the exception catch blocks, and kills the job.

Looks like I need to get better at following my own advice – a bit of defensive programming would have saved me endless hours of debugging and config-thrashing.


Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: