What is Big Data, and why you could care

January 8, 2011

That’s the title for my talk this coming Thursday  night (January 13th, 2011), at the first “Tech Talk” sponsored by Sierra Commons. Details are at http://sierracommons.org/2011/local-business/2108.

Erika Kosina has done a great job of setting this up, and I’m looking forward to meeting more of the local tech community.

Some of the things I’ll be touching on are:

  • Why big data is the cool new kid on the block, or why I wished I really knew statistics.
  • Crunching big data to filter, cluster, classify and recommend.
  • How big data + machine learning == more effective advertising, for better or worse.
  • The basics of Hadoop, an open source data processing system.

Hope to see you there!

 


Big Data talk for technical non-programmers

December 22, 2010

I’ve going to be giving a talk on big data for the newly formed Nevada County Tech Talk event – a monthly gathering at Sierra Commons.

Unfortunately most of the relevant content I’ve got is for Java programmers interested in using Hadoop. Things I could talk about, based on personal experience:

  • A 600M page web crawl using Bixo.
  • Using LibSVM to predict medications from problems.
  • Using Mahout’s kmeans clustering algorithm on pages referenced from tweets (the unfinished Fokii service).

I’m looking for relevant talks that I can borrow from, but I haven’t found much that’s targeted at the technically minded-but-not-a-programmer crowd.

Comments with pointers to useful talks/presentations would be great!

 


New power supplies sometimes don’t work with old MacBooks

August 11, 2010

Recently I had to buy a new power supply for my 2008 MacBook. Because having three adapters isn’t enough, when you forget to bring any of them with you on a business trip.

So I ran into the Apple store in San Francisco and grabbed a new 60 watt adapter – the one with the “L” style MagSafe connector, versus the older “T” style connectors I’ve got on my other three adapters.

Raced back to the client. Plugged it in – and it didn’t work. Spent 20 minutes cleaning my connector, trying different outlets, etc. No luck.

Headed back to the Apple store, and verified the following:

  • My new adapter works with three different MacBooks on display.
  • None of the 60 watt power adapters with “L” style connectors being used for display Macs worked with my MacBook, but all of the 60 watt adapters with older “T” style connectors did work.
  • The 85 watt power adapter at the Genius Bar did work with my MacBook.
  • The new 85 watt power adapter that Mitch @ the Bar set me up with didn’t work with my MacBook.
  • The older 60 watt power adapter Mitch extracted from the store’s repair supply stock did work.

After all of the above, I got in touch with a friend who works as a Genius at the Manhattan store. Turns out she’d just had to deal with a similar issue, and the root of the problem is that the System Management Controller (SMC) needs to be reset for some older MacBooks to work properly with new power adapters.

Apple has information about how to reset the SMC, and on that page it lists one of the reasons why you need to do this as “The battery does not appear to be charging properly”.

I’m hoping Apple updates the info found on both this page and their Troubleshooting MagSafe adapters page, to make it easier to find in the future for other users. Before Apple Stores run out of these older “T” style power adapters.


Java case change for canonical paths on Mac OS X

May 27, 2010

I ran into a puzzling test failure recently, which I ultimately tracked down to some very strange directory name handling behavior in Mac OS X (I’m running 10.5).

Previously I’d had a directory in my user directory called “SVN”, and this is where I’d checked out all of my SVN-hosted projects. At some point in the past I changed the name of this directory to be “svn”.

In the terminal, it shows the directory as having the lower-case name, as expected.

But in Java, if I call File.getCanonicalPath() on a file in this directory, the directory name comes back as the old “SVN”. And that in turn caused some tests to make assumptions about the nature of the filesystem, which triggered a cascade of failures.

To fix it, I created a new temp directory, moved everything from inside “svn” over, deleted “svn”, then created a new “svn” and moved everything back. Really strange…


Git failed to push some refs – the multiple branch variant

February 25, 2010

By now I knew enough about Git to easily deal with the “error: failed to push some refs” error.

Just pull first, fix any merge problems, and then push.

But this morning I still got the error after doing a pull, then a push to the Bixo project on GitHub.
It turns out I need to read the git output more closely. The error message said:

! [rejected]        master -> master (non-fast forward)
error: failed to push some refs to 'git@github.com:bixo/bixo.git'

But I’m working in a branch (called ‘fetch-flow’). And the “git push” command will try to push all of the branches. But “git pull” only pulls the current branch.

So I had to “git checkout master” to switch to the master branch, then “git pull” to bring that up to date, then “git checkout fetch-flow” to switch back to my branch.

And now my git push works fine. Note that the push did work, in that my ‘fetch-flow’ branch was pushed – it’s just that the auto-push to master failed, and that made me think my entire push had failed.


Getting a category-specific RSS feed to a WordPress blog

December 21, 2009

I poked around a bit, and didn’t find any direct info on how to do this, so here’s the results of my research.

If you have a WordPress.com-hosted blog (like this one), and you use categories, then the RSS feed is:

http://<domain>/category/<category name>/feed/

For example, the RSS feed for things I’ve categorized as being about “Nevada City” is https://ken-blog.krugler.org/category/nevada-city/feed/


The web is an endless series of edge cases

December 17, 2009

Recently I’d been exchanging emails with Jimmy Lin at CMU. Jimmy has written up some great Hadoop info, and provided some useful classes for working with the ClueWeb09 dataset.

In one of his emails, he said:

However, what I’ve learned is that whenever you’re working with web-scale collections, it exposes bugs in otherwise seemingly solid code.  Sometimes it’s not bugs, but rather obscure corner cases that don’t happen for the most part.  Screwy data is inevitable…

I borrowed his “screwy data is inevitable” line for the talk I gave at December’s ACM data mining SIG event, and added a comment about this being the reason for having to write super-defensive code when implementing anything that touched the web.

Later that same week, I was debugging a weird problem with my Elastic MapReduce web crawling job for the Public Terabyte Datset project. At some point during one of the steps, I was getting LeaseExpiredExceptions in the logs, and the job was failing. I posted details to the Hadoop list, and got one response from Jason Venner about a similar problem he’d run into.

Is it possible that this is occurring in a task that is being killed by the framework. Sometimes there is a little lag, between the time the tracker ‘kills a task’ and the task fully dies, you could be getting into a situation like that where the task is in the process of dying but the last write is still in progress.
I see this situation happen when the task tracker machine is heavily loaded. In once case there was a 15 minute lag between the timestamp in the tracker for killing task XYZ, and the task actually going away.

It took me a while to work this out as I had to merge the tracker and task logs by time to actually see the pattern. The host machines where under very heavy io pressure, and may have been paging also. The code and configuration issues that triggered this have been resolved, so I don’t see it anymore.

This led me down the path of increasing the size of my master instance (I was incorrectly using m1.small with a 50 server cluster), increasing the number of tasktracker.http.threads from 20 to 100, etc. All good things, but nothing that fixed the problem.

However Jason’s email about merging multiple logs by timestamp value led me to go through all of the logs in more detail. And this led me to the realization that the job previous to where I was seeing a LeaseExpiredException had actually died quite suddenly. I then checked the local logs I wrote out, and I saw that this was right after a statement about parsing an “unusual” file from stanford.edu: http://library.stanford.edu/depts/ssrg/polisci/NGO_files/oledata.mso

The server returns “text/plain” for this file, when in fact it’s a Microsoft Office document. I filter out everything that’s not plain text or HTML, which lets me exclude a bunch of huge Microsoft-specific parse support jars from my Hadoop job jar. When you’re repeatedly pushing jars to S3 via a thin DSL connection, saving 20MB is significant.

But since the server lies like a rug in this case, I pass it on through to the Tika AutoDectectParser. And that in turn correctly figures out that it’s a Microsoft Office document, and makes a call to a non-existing method. Which throws a NoSuchMethodError (not an Exception!). Since it’s an Error, this flies right on by all of the exception catch blocks, and kills the job.

Looks like I need to get better at following my own advice – a bit of defensive programming would have saved me endless hours of debugging and config-thrashing.


Git and unreferenced blobs and Stack Overflow

December 16, 2009

I ran into a problem yesterday, while trying to prune the size of the Bixo repo at GitHub (450MB, ouch).

I deleted the release branch, first on GitHub and then locally. This is what contained a bunch of big binary blobs (release distribution jars). But even after this work, I still had 250MB+ in my local & GitHub repository.

Following some useful steps I found on Stack Overflow, I could isolate the problem down to a few unreferenced blobs. By “unreferenced” I mean these were blobs with SHA1s that could not be located anywhere in the git tree/history by the various scripts I found on Stack Overflow.

I posted a question about this to Stack Overflow, and got some very useful answers, though nothing that directly solved the problem. But it turns out that a fresh clone from GitHub is much smaller, and these dangling blobs are gone. So I think it’s a git bug, where these blobs get left around locally but are correctly cleared from the remote repo.

But I ran into a new problem today with my local Bixo repo, where I couldn’t push changes. I’d get this output from my “git push” command:

Counting objects: 92, done.
Delta compression using 2 threads.
Compressing objects: 100% (53/53), done.
Writing objects: 100% (57/57), 11.50 KiB, done.
Total 57 (delta 28), reused 0 (delta 0)
error: insufficient permission for adding an object to repository database ./objects

fatal: failed to write object
error: unpack-objects exited with error code 128
error: unpack failed: unpack-objects abnormal exit
To git@github.com:bixo/bixo.git
 ! [remote rejected] master -> master (n/a (unpacker error))
error: failed to push some refs to 'git@github.com:bixo/bixo.git'

No solutions came up while searching, but the problem doesn’t exist for a fresh clone, so I’m manually migrating my changes over to the fresh copy, and then I’ll happily delete my apparently messed up older local git repo and move on to more productive uses of my time.

[UPDATE: The problem does actually exist in a fresh clone, but only for the second push. Eventually GitHub support resolved the issue by fixing permissions of some files on their side of the fence. Apparently things got “messed up” during the fork from the original EMI/bixo repo]


Fixing Firefox default monitor

October 26, 2009

I’m running Firefox 3.0.14 on Mac OS X 10.5.

I’ve got a MacBook laptop and a 24″ LCD display as my normal configuration, though sometimes I’m just using the laptop.

Whenever I open a new browser window, it defaults to the laptop display, not the big LCD, even though that’s my main screen.

I searched the forums, and didn’t find any good solution, so here’s what worked for me:

  1. Quit Firefox
  2. Locate the localstore.rdf file in your Firefox profile directory. This will be in the ~/Library/Application Support/Firefox/Profiles/<random string>.default/ directory.
  3. Open it with your favorite text editor.
  4. Find the RDF section with the description set to “chrome://browser/content/browser.xul#main-window”
  5. Set the screenX and screenY values to 0.
  6. Save the file.
  7. Restart Firefox

In my case, for example, the prior contents of this file were:

<RDF:Description RDF:about="chrome://browser/content/browser.xul#main-window"
    height="778"
    screenX="-1273"
    screenY="401"
    width="1276"
    sizemode="maximized" />

By setting screenX=”0″ and screenY=”0″, I was able to fix my problem.


Performance problems with vertical/focused web crawling

May 19, 2009

Over at the Nutch mailing list, there are regular posts complaining about the performance of the new queue-based fetcher (aka Fetcher2) that became the default fetcher when Nutch 1.0 was released. For example:

Not sure if that problem is solved, I have it and reported it in a previous thread. Extremely fast fetch at the beginning and damn slow fetches after a while.

There’s also a Jira issue (NUTCH-721) filed on the problem.

But in my experience using Nutch to do vertical/focused crawls, this problem of having very slow fetch performance at the end of a crawl is a fundamental problem caused by not enough unique domains. If a crawler is polite, then once the number of unique domains drops significantly (because you’ve fetched all of the URLs for most of the domains), the fetch performance always drops rapidly, at least if your crawler is properly obeying robots.txt and the default rules for polite crawling.

Just for grins, I tracked a number of metrics at the tail end of a vertical crawl I was just doing using Bixo – that’s the vertical crawler toolkit I’ve been working on for the past two months. The system configuration (in Amazon’s EC2) is an 11 server cluster (1 master, 10 slaves) using the small EC2 instance. I run 2 reducers per server, with a maximum of 200 fetcher threads per reducer. So the theoretical maximum is 4000 active fetch threads, which is way more than I needed, but I was also testing memory usage (primarily kernel memory) of threads, so I’d cranked this way up.

I started out with 1,264,539 URLs from 41,978 unique domains, where I classify domains using the “paid level” ontology as described in the IRLbot paper. So http://www.ibm.com, blogs.us.ibm.com, and ibm.com are all the same domain.

Here’s the performance graph after one hour, which is when the crawl seemed to enter the “long tail” fetch phase…

Fetch Performance

The key things to note from this graph are:

  • The 41K unique domains were down to 1700 after an hour, and then slowly continued to drop. This directly impacts the number of simultaneous fetches that can politely execute at the same time. In fact there were only 240 parallel fetches (== 240 domains) after an hour, and 64 after three hours.
  • Conversely, the average number of URLs per domain climbs steadily, which means the future fetch rate will continue to drop.
  • And so it does, going from almost 9K/second (scaled to 10ths of second in the graph) after one hour down to 7K/second after four hours.

I think this represents a typical vertical/focused crawl, where a graph of the number of URLs/domain would show a very strong exponential decay. So once you’ve fetched the single URLs from a lot of different domains, you’re left with lots of URLs for a much smaller number of domains. And your performance will begin to stink.

The solution I’m using in Bixo is to specify the target fetch duration. From this, I can estimate the number of URLs per domain I might be able to get, and so I pre-prune the URLs put into each domain’s fetch queue. This works well for the type of data processing workflow that the current commercial users of Bixo need, where Bixo is a piece in the data processing pipeline that needs to play well (ie doesn’t stall the entire process).

Anyway, I keep thinking that perhaps some of the reported problems with Nutch’s Fetcher2 are actually a sign that the fetcher is being appropriately polite, and the comparison with the old fetcher is flawed because that version had bugs where it would act impolitely.