What is Big Data, and why you could care

January 8, 2011

That’s the title for my talk this coming Thursday  night (January 13th, 2011), at the first “Tech Talk” sponsored by Sierra Commons. Details are at http://sierracommons.org/2011/local-business/2108.

Erika Kosina has done a great job of setting this up, and I’m looking forward to meeting more of the local tech community.

Some of the things I’ll be touching on are:

  • Why big data is the cool new kid on the block, or why I wished I really knew statistics.
  • Crunching big data to filter, cluster, classify and recommend.
  • How big data + machine learning == more effective advertising, for better or worse.
  • The basics of Hadoop, an open source data processing system.

Hope to see you there!



Big Data talk for technical non-programmers

December 22, 2010

I’ve going to be giving a talk on big data for the newly formed Nevada County Tech Talk event – a monthly gathering at Sierra Commons.

Unfortunately most of the relevant content I’ve got is for Java programmers interested in using Hadoop. Things I could talk about, based on personal experience:

  • A 600M page web crawl using Bixo.
  • Using LibSVM to predict medications from problems.
  • Using Mahout’s kmeans clustering algorithm on pages referenced from tweets (the unfinished Fokii service).

I’m looking for relevant talks that I can borrow from, but I haven’t found much that’s targeted at the technically minded-but-not-a-programmer crowd.

Comments with pointers to useful talks/presentations would be great!


New power supplies sometimes don’t work with old MacBooks

August 11, 2010

Recently I had to buy a new power supply for my 2008 MacBook. Because having three adapters isn’t enough, when you forget to bring any of them with you on a business trip.

So I ran into the Apple store in San Francisco and grabbed a new 60 watt adapter – the one with the “L” style MagSafe connector, versus the older “T” style connectors I’ve got on my other three adapters.

Raced back to the client. Plugged it in – and it didn’t work. Spent 20 minutes cleaning my connector, trying different outlets, etc. No luck.

Headed back to the Apple store, and verified the following:

  • My new adapter works with three different MacBooks on display.
  • None of the 60 watt power adapters with “L” style connectors being used for display Macs worked with my MacBook, but all of the 60 watt adapters with older “T” style connectors did work.
  • The 85 watt power adapter at the Genius Bar did work with my MacBook.
  • The new 85 watt power adapter that Mitch @ the Bar set me up with didn’t work with my MacBook.
  • The older 60 watt power adapter Mitch extracted from the store’s repair supply stock did work.

After all of the above, I got in touch with a friend who works as a Genius at the Manhattan store. Turns out she’d just had to deal with a similar issue, and the root of the problem is that the System Management Controller (SMC) needs to be reset for some older MacBooks to work properly with new power adapters.

Apple has information about how to reset the SMC, and on that page it lists one of the reasons why you need to do this as “The battery does not appear to be charging properly”.

I’m hoping Apple updates the info found on both this page and their Troubleshooting MagSafe adapters page, to make it easier to find in the future for other users. Before Apple Stores run out of these older “T” style power adapters.

Java case change for canonical paths on Mac OS X

May 27, 2010

I ran into a puzzling test failure recently, which I ultimately tracked down to some very strange directory name handling behavior in Mac OS X (I’m running 10.5).

Previously I’d had a directory in my user directory called “SVN”, and this is where I’d checked out all of my SVN-hosted projects. At some point in the past I changed the name of this directory to be “svn”.

In the terminal, it shows the directory as having the lower-case name, as expected.

But in Java, if I call File.getCanonicalPath() on a file in this directory, the directory name comes back as the old “SVN”. And that in turn caused some tests to make assumptions about the nature of the filesystem, which triggered a cascade of failures.

To fix it, I created a new temp directory, moved everything from inside “svn” over, deleted “svn”, then created a new “svn” and moved everything back. Really strange…

Git failed to push some refs – the multiple branch variant

February 25, 2010

By now I knew enough about Git to easily deal with the “error: failed to push some refs” error.

Just pull first, fix any merge problems, and then push.

But this morning I still got the error after doing a pull, then a push to the Bixo project on GitHub.
It turns out I need to read the git output more closely. The error message said:

! [rejected]        master -> master (non-fast forward)
error: failed to push some refs to 'git@github.com:bixo/bixo.git'

But I’m working in a branch (called ‘fetch-flow’). And the “git push” command will try to push all of the branches. But “git pull” only pulls the current branch.

So I had to “git checkout master” to switch to the master branch, then “git pull” to bring that up to date, then “git checkout fetch-flow” to switch back to my branch.

And now my git push works fine. Note that the push did work, in that my ‘fetch-flow’ branch was pushed – it’s just that the auto-push to master failed, and that made me think my entire push had failed.

Getting a category-specific RSS feed to a WordPress blog

December 21, 2009

I poked around a bit, and didn’t find any direct info on how to do this, so here’s the results of my research.

If you have a WordPress.com-hosted blog (like this one), and you use categories, then the RSS feed is:

http://<domain>/category/<category name>/feed/

For example, the RSS feed for things I’ve categorized as being about “Nevada City” is https://ken-blog.krugler.org/category/nevada-city/feed/

The web is an endless series of edge cases

December 17, 2009

Recently I’d been exchanging emails with Jimmy Lin at CMU. Jimmy has written up some great Hadoop info, and provided some useful classes for working with the ClueWeb09 dataset.

In one of his emails, he said:

However, what I’ve learned is that whenever you’re working with web-scale collections, it exposes bugs in otherwise seemingly solid code.  Sometimes it’s not bugs, but rather obscure corner cases that don’t happen for the most part.  Screwy data is inevitable…

I borrowed his “screwy data is inevitable” line for the talk I gave at December’s ACM data mining SIG event, and added a comment about this being the reason for having to write super-defensive code when implementing anything that touched the web.

Later that same week, I was debugging a weird problem with my Elastic MapReduce web crawling job for the Public Terabyte Datset project. At some point during one of the steps, I was getting LeaseExpiredExceptions in the logs, and the job was failing. I posted details to the Hadoop list, and got one response from Jason Venner about a similar problem he’d run into.

Is it possible that this is occurring in a task that is being killed by the framework. Sometimes there is a little lag, between the time the tracker ‘kills a task’ and the task fully dies, you could be getting into a situation like that where the task is in the process of dying but the last write is still in progress.
I see this situation happen when the task tracker machine is heavily loaded. In once case there was a 15 minute lag between the timestamp in the tracker for killing task XYZ, and the task actually going away.

It took me a while to work this out as I had to merge the tracker and task logs by time to actually see the pattern. The host machines where under very heavy io pressure, and may have been paging also. The code and configuration issues that triggered this have been resolved, so I don’t see it anymore.

This led me down the path of increasing the size of my master instance (I was incorrectly using m1.small with a 50 server cluster), increasing the number of tasktracker.http.threads from 20 to 100, etc. All good things, but nothing that fixed the problem.

However Jason’s email about merging multiple logs by timestamp value led me to go through all of the logs in more detail. And this led me to the realization that the job previous to where I was seeing a LeaseExpiredException had actually died quite suddenly. I then checked the local logs I wrote out, and I saw that this was right after a statement about parsing an “unusual” file from stanford.edu: http://library.stanford.edu/depts/ssrg/polisci/NGO_files/oledata.mso

The server returns “text/plain” for this file, when in fact it’s a Microsoft Office document. I filter out everything that’s not plain text or HTML, which lets me exclude a bunch of huge Microsoft-specific parse support jars from my Hadoop job jar. When you’re repeatedly pushing jars to S3 via a thin DSL connection, saving 20MB is significant.

But since the server lies like a rug in this case, I pass it on through to the Tika AutoDectectParser. And that in turn correctly figures out that it’s a Microsoft Office document, and makes a call to a non-existing method. Which throws a NoSuchMethodError (not an Exception!). Since it’s an Error, this flies right on by all of the exception catch blocks, and kills the job.

Looks like I need to get better at following my own advice – a bit of defensive programming would have saved me endless hours of debugging and config-thrashing.