Bixolabs goes public

November 2, 2009

I’ve been working on an elastic web mining platform for a few months now, and it was finally time to go public with at least the current state of the union.

I gave a talk at the ACM Data Mining Unconference on Sunday, where I also announced the Public Terabyte Dataset project, so the timing was perfect.

If you want to know what’s been keeping me busy, and looks to be part of my future, check out http://bixolabs.com.

 


Fixing Firefox default monitor

October 26, 2009

I’m running Firefox 3.0.14 on Mac OS X 10.5.

I’ve got a MacBook laptop and a 24″ LCD display as my normal configuration, though sometimes I’m just using the laptop.

Whenever I open a new browser window, it defaults to the laptop display, not the big LCD, even though that’s my main screen.

I searched the forums, and didn’t find any good solution, so here’s what worked for me:

  1. Quit Firefox
  2. Locate the localstore.rdf file in your Firefox profile directory. This will be in the ~/Library/Application Support/Firefox/Profiles/<random string>.default/ directory.
  3. Open it with your favorite text editor.
  4. Find the RDF section with the description set to “chrome://browser/content/browser.xul#main-window”
  5. Set the screenX and screenY values to 0.
  6. Save the file.
  7. Restart Firefox

In my case, for example, the prior contents of this file were:

<RDF:Description RDF:about="chrome://browser/content/browser.xul#main-window"
    height="778"
    screenX="-1273"
    screenY="401"
    width="1276"
    sizemode="maximized" />

By setting screenX=”0″ and screenY=”0″, I was able to fix my problem.


The WordPress Business Model

September 11, 2009

I think I finally understand how hosted WordPress makes money :)

I recently set up a web site for my dad’s consulting business, at KruglerEngineeringGroup.com. I used the WordPress hosted service, and a flexible, business-oriented theme called Vigilance.

But I needed to tweak the colors to get a solid background, with white-on-blue text. It was pretty easy (using Firebug) to figure out the CSS changes required, and I could edit these in the WordPress Custom CSS form, and I got the look I wanted – so the hook was set. Now I just need to pay for the $14.97/year “upgrade” to be able to save and use the custom CSS.

Which I gladly did, since it would be way more expensive for me in time and hassle to try to set this up in my own WordPress environment.

Step 2 was connecting his existing KruglerEngineeringGroup.com domain to the WordPress site. A few clicks on the WordPress.com site, another modest yearly payment of $9.97 (where do they get these amounts?), and we were almost all set. The one minor difficulty was in handling the “www” subdomain. WordPress says that if you want this to work, you need to change the domain name servers to use their name servers. But the current domain needs to use a specific email server (MX record).

So the solution was to create two DNS entries in the current name server config. One was the standard WordPress entry for subdomains, where you create a CNAME record that maps “@” to kruglerengineeringgroup.wordpress.com. The second entry mapped “www” as a URL redirect to http://kruglerengineeringgroup.com. Once that propagated, everything worked as planned. A few hours of my time, and $24.94/year to WordPress.


Local mines near Nevada City

August 25, 2009

While searching (in vain) for a cool new domain name, I stumbled upon the CaliforniaMaps.org web site. As you can see from the snapshot below, there’s lots of local places to go looking for empty mine shafts to fall down:

Nevada County Mines

Check out the interactive version here.


A Man and his EuroVan Camper

August 8, 2009

After 10 years of on-and-off discussion, we finally took the plunge and bought a 1997 EuroVan Camper – or 97EVC for members of the club.

EuroVan Camper at Westport

EuroVan Camper at Westport

Being able to pull in, pop the top, and kick back was huge.

Though we’re still working out the kinks in our travel setup and procedures – as a fellow EVCer said, it’s like living in a sailboat. You have to plan things a few moves in advance, so you don’t wind with the bed extended and the (blocked) cabinet containing your toothbrush.

And thank goodness for the EVC Yahoo group – without their help, I would have been totally stuck.

We’ve got a typical list of things to fix, buy, and figure out before the next big trip:

  • The dreaded Norcold refrigerator stopped working on propane. And the burner on light fell out (again).
  • Cruise control stopped working.
  • There’s a small coolant leak.
  • The driver’s side windshield wiper fluid doesn’t squirt.
  • There’s a new crack in the windshield.
  • The headlight low beams are way too low.
  • The rear (hatch) door sometimes doesn’t unlock.

But in spite of the problems, we had a great time. And we would have never spent two wonderful days camping in the redwoods at Humboldt State Park, or seen this amazing memorial to the town of Pepperwood, which was wiped out in the 1964 Eel River flood.

Memorial for town of Pepperwood

Memorial for town of Pepperwood

The caption says

To Pepperwood

And It’s Loved Ones

Gone but not forgotten

Presented by

Fortuna Chamber of Commerce

Inquiries to the Fortuna CoC for background and grammar checks have gone unanswered.


Why I buy from Patagonia

May 27, 2009

Yes, it costs more for Patagonia. But the way they treat me as a customer makes me happy to pay a premium…as my latest experience shows.

I had a pair of Patagonia gortex pants from way-back-when. Worked fine, though my duct tape patch job ruined the clean lines – I’d accidentally stuck my ice axe through the pants and into my left leg, instead of the glacier, during a glissade off Rainier.

And then this past snowboarding season some seam sealing tape started coming off, so things began to get a bit wet at times. I sent the pants to Patagonia, with a note explaining that I’d also be happy to pay for a real repair job of my ice axe mishap.

Yesterday I got a Patagonia gift card in the mail, for $238.44. No idea how they calculated that amount, but I’m looking forward to buying a replacement pair of pants. And they’ve reaffirmed my belief that paying for quality gear winds up being cheaper in the end.


Performance problems with vertical/focused web crawling

May 19, 2009

Over at the Nutch mailing list, there are regular posts complaining about the performance of the new queue-based fetcher (aka Fetcher2) that became the default fetcher when Nutch 1.0 was released. For example:

Not sure if that problem is solved, I have it and reported it in a previous thread. Extremely fast fetch at the beginning and damn slow fetches after a while.

There’s also a Jira issue (NUTCH-721) filed on the problem.

But in my experience using Nutch to do vertical/focused crawls, this problem of having very slow fetch performance at the end of a crawl is a fundamental problem caused by not enough unique domains. If a crawler is polite, then once the number of unique domains drops significantly (because you’ve fetched all of the URLs for most of the domains), the fetch performance always drops rapidly, at least if your crawler is properly obeying robots.txt and the default rules for polite crawling.

Just for grins, I tracked a number of metrics at the tail end of a vertical crawl I was just doing using Bixo – that’s the vertical crawler toolkit I’ve been working on for the past two months. The system configuration (in Amazon’s EC2) is an 11 server cluster (1 master, 10 slaves) using the small EC2 instance. I run 2 reducers per server, with a maximum of 200 fetcher threads per reducer. So the theoretical maximum is 4000 active fetch threads, which is way more than I needed, but I was also testing memory usage (primarily kernel memory) of threads, so I’d cranked this way up.

I started out with 1,264,539 URLs from 41,978 unique domains, where I classify domains using the “paid level” ontology as described in the IRLbot paper. So www.ibm.com, blogs.us.ibm.com, and ibm.com are all the same domain.

Here’s the performance graph after one hour, which is when the crawl seemed to enter the “long tail” fetch phase…

Fetch Performance

The key things to note from this graph are:

  • The 41K unique domains were down to 1700 after an hour, and then slowly continued to drop. This directly impacts the number of simultaneous fetches that can politely execute at the same time. In fact there were only 240 parallel fetches (== 240 domains) after an hour, and 64 after three hours.
  • Conversely, the average number of URLs per domain climbs steadily, which means the future fetch rate will continue to drop.
  • And so it does, going from almost 9K/second (scaled to 10ths of second in the graph) after one hour down to 7K/second after four hours.

I think this represents a typical vertical/focused crawl, where a graph of the number of URLs/domain would show a very strong exponential decay. So once you’ve fetched the single URLs from a lot of different domains, you’re left with lots of URLs for a much smaller number of domains. And your performance will begin to stink.

The solution I’m using in Bixo is to specify the target fetch duration. From this, I can estimate the number of URLs per domain I might be able to get, and so I pre-prune the URLs put into each domain’s fetch queue. This works well for the type of data processing workflow that the current commercial users of Bixo need, where Bixo is a piece in the data processing pipeline that needs to play well (ie doesn’t stall the entire process).

Anyway, I keep thinking that perhaps some of the reported problems with Nutch’s Fetcher2 are actually a sign that the fetcher is being appropriately polite, and the comparison with the old fetcher is flawed because that version had bugs where it would act impolitely.


Google Earth overlay for California Thirteeners peak list

May 11, 2009

My friend Schmed has compiled the definite list of California peaks that are at least 13,000ft in height. This labor of love has consumed countless hours, and now he’s adding to the effort by creating a database-driven GUI.

While incredibly detailed and accurate, his data set wasn’t all that useful to me when thinking about climbing trips. I’d still wind up hunched over my old “Guide to the John Muir Wilderness and Sequoia-Kens Canyon Wilderness” maps, with R. J. Secor’s “The High Sierra” book in hand, trying to figure out possible routes to interesting areas.

So I wrote a program to convert his data into a Google Earth-compatible KML file, which I could then use to visual the peak list in glorious 3D. The resulting file has proven very useful, so I thought I’d share it via this blog post – and provide a bit of commentary regarding the program/process at the same time.

Google Earth peaks

First, notes about the file:

  • You can download it here. Then just open it from Google Earth.
  • I use different color pushpins to denote the difficulty of reaching the summit. Green is for class 1 or 2, yellow for class 3, and red for class 4. I didn’t factor in the higher difficulty of the summit block, as many of the peaks are class 2 or 3 to the base of the summit block, but the block itself is class 4.
  • In the peak description, I tried to generate links to trip reports on Climber.org, but not all of these will be valid. Usually this is because the peak in question has no Climber.org trip report, but a few are due to issues with reverse-engineering the “shortened name” algorithm used at that site when grouping trip reports.
  • The same thing is true for links to Secor’s “The High Sierra” book at Google Books. I have page numbers, but not all pages are available (as one would expect), and sometimes the peak name used to highlight entries on the page won’t match the name that Secor used.

Next, some notes on the KML format:

  • The on-line documentation is really good, especially the KML Reference provided by Google.
  • I ran into a few minor problems, where no error would be reported by Google Earth when loading my file, but problems in the data meant that I wouldn’t see the expected result. For example, I’d accidentally specified the <color> value as hex-ified RGB (e.g. “ffffff” for white) instead of ABGR (alpha/blue/green/red), which needs eight hex digits. Also I’d added an <IconStyle> element with a an <href> child, but I needed to put the href inside of an <Icon> element. Minor things, but a bit frustrating to debug without any useful error being resported by Google Earth.
  • I wanted to use different built-in icons, but didn’t see a document listing all of these on Google. Eventually I found the list I needed in a Google forum post titled “Setting KML icon colors“.

I’ve posted source for the Java program used to generate the KML file. It’s located in my GitHub account, at the peaks2kml repository.

This Java program should have been trivial to write – basically convert from a text file dump of a database into the KML format. But I ran into one painful issue, which was converting from the NAD27 UTM locations into longitude/latitude. Seems like this bites everybody, and the lack of a universal, high quality Java package is frustrating.

I’m using the GeoTransform package, but I didn’t see a clean way to specify the source UTM datum as NAD27. I did figure out that the Clarke 1866 ellipsoid was the right one to use for conversion, and dumped out some results. I compared these with manual results from an excellent on-line UTM conversion page, and then used the delta (which appeared to be relatively constant) to adjust my results. Ugly, but close enough for a first cut.

And if I had to do it again, I’d probably use something like the KML beans (e.g. StyleType.java) from the Luzan project, and an XML package to convert the resulting object graph to a textual KML representation.


Yet another great git error message – expected sha/ref, got ‘

April 14, 2009

I’d been working away on the Bixo project, and pushing changes to GitHub without any problems.

Then I made the mistake of pulling in a new branch, versus creating the branch.

% git checkout origin cfetcher
% git pull

This merged the remote branch into my local master branch, with bizarre results. After a few attempts at trying to back it out, I blew away my local directory and just re-cloned the remote cfetcher branch, since that’s where I’d be working for the next few days. Unfortunately when I cloned it, I did:

% git clone git://github.com/emi/bixo.git

That created a clone using the GitHub “Public Clone URL”, not the “Your Clone URL”, which is git@github.com:emi/bixo.git. Oops.

Everything worked, though, until I wanted to push back some changes:

% git push
fatal: protocol error: expected sha/ref, got '
*********'

You can't push to git://github.com/user/repo.git
Use git@github.com:user/repo.git

*********'

Expected sha/ref? Though the error message had all of the info I needed, just not in a format that was obvious. For example, a good message would have said:

You can't push to git://github.com/emi/bixo.git
Update the url for the "origin" remote in your .git/config file to use git@github.com:emi/bixo.git

Eventually the Supercharged git-daemon blog post at GitHub cleared things up for me. I edited the URL entry in my .git/config file, and all is (once again) well.

[remote "origin"]
    url = git@github.com:emi/bixo.git
    fetch = +refs/heads/*:refs/remotes/origin/*

Merging in a GitHub fork

April 14, 2009

I’m working on a new project in GitHub called Bixo, and recently had to merge in a fork from Chris Wensel. After poking around on the web a bit, I found some very useful information in Willem’s blog post on Remote branches in git.

There was one minor error, though, in the “Merging back a fork” section. After the “git remote add…” command, you have to do a “git fetch <remote>” command to first fetch the remote branches before you can successfully do a “git branch <branch name> <remote/branch>” command.

So in my case, this meant:

% cd git/github/bixo
% git remote add chris git://github.com/cwensel/bixo.git
% git fetch chris
% git branch chris-fork chris/master

And once that worked, I could merge from his branch to mine, and push back the changes.

Slowly but surely the git model accretes in my head.