Why I buy from Patagonia

May 27, 2009

Yes, it costs more for Patagonia. But the way they treat me as a customer makes me happy to pay a premium…as my latest experience shows.

I had a pair of Patagonia gortex pants from way-back-when. Worked fine, though my duct tape patch job ruined the clean lines – I’d accidentally stuck my ice axe through the pants and into my left leg, instead of the glacier, during a glissade off Rainier.

And then this past snowboarding season some seam sealing tape started coming off, so things began to get a bit wet at times. I sent the pants to Patagonia, with a note explaining that I’d also be happy to pay for a real repair job of my ice axe mishap.

Yesterday I got a Patagonia gift card in the mail, for $238.44. No idea how they calculated that amount, but I’m looking forward to buying a replacement pair of pants. And they’ve reaffirmed my belief that paying for quality gear winds up being cheaper in the end.


Performance problems with vertical/focused web crawling

May 19, 2009

Over at the Nutch mailing list, there are regular posts complaining about the performance of the new queue-based fetcher (aka Fetcher2) that became the default fetcher when Nutch 1.0 was released. For example:

Not sure if that problem is solved, I have it and reported it in a previous thread. Extremely fast fetch at the beginning and damn slow fetches after a while.

There’s also a Jira issue (NUTCH-721) filed on the problem.

But in my experience using Nutch to do vertical/focused crawls, this problem of having very slow fetch performance at the end of a crawl is a fundamental problem caused by not enough unique domains. If a crawler is polite, then once the number of unique domains drops significantly (because you’ve fetched all of the URLs for most of the domains), the fetch performance always drops rapidly, at least if your crawler is properly obeying robots.txt and the default rules for polite crawling.

Just for grins, I tracked a number of metrics at the tail end of a vertical crawl I was just doing using Bixo – that’s the vertical crawler toolkit I’ve been working on for the past two months. The system configuration (in Amazon’s EC2) is an 11 server cluster (1 master, 10 slaves) using the small EC2 instance. I run 2 reducers per server, with a maximum of 200 fetcher threads per reducer. So the theoretical maximum is 4000 active fetch threads, which is way more than I needed, but I was also testing memory usage (primarily kernel memory) of threads, so I’d cranked this way up.

I started out with 1,264,539 URLs from 41,978 unique domains, where I classify domains using the “paid level” ontology as described in the IRLbot paper. So www.ibm.com, blogs.us.ibm.com, and ibm.com are all the same domain.

Here’s the performance graph after one hour, which is when the crawl seemed to enter the “long tail” fetch phase…

Fetch Performance

The key things to note from this graph are:

  • The 41K unique domains were down to 1700 after an hour, and then slowly continued to drop. This directly impacts the number of simultaneous fetches that can politely execute at the same time. In fact there were only 240 parallel fetches (== 240 domains) after an hour, and 64 after three hours.
  • Conversely, the average number of URLs per domain climbs steadily, which means the future fetch rate will continue to drop.
  • And so it does, going from almost 9K/second (scaled to 10ths of second in the graph) after one hour down to 7K/second after four hours.

I think this represents a typical vertical/focused crawl, where a graph of the number of URLs/domain would show a very strong exponential decay. So once you’ve fetched the single URLs from a lot of different domains, you’re left with lots of URLs for a much smaller number of domains. And your performance will begin to stink.

The solution I’m using in Bixo is to specify the target fetch duration. From this, I can estimate the number of URLs per domain I might be able to get, and so I pre-prune the URLs put into each domain’s fetch queue. This works well for the type of data processing workflow that the current commercial users of Bixo need, where Bixo is a piece in the data processing pipeline that needs to play well (ie doesn’t stall the entire process).

Anyway, I keep thinking that perhaps some of the reported problems with Nutch’s Fetcher2 are actually a sign that the fetcher is being appropriately polite, and the comparison with the old fetcher is flawed because that version had bugs where it would act impolitely.


Google Earth overlay for California Thirteeners peak list

May 11, 2009

My friend Schmed has compiled the definite list of California peaks that are at least 13,000ft in height. This labor of love has consumed countless hours, and now he’s adding to the effort by creating a database-driven GUI.

While incredibly detailed and accurate, his data set wasn’t all that useful to me when thinking about climbing trips. I’d still wind up hunched over my old “Guide to the John Muir Wilderness and Sequoia-Kens Canyon Wilderness” maps, with R. J. Secor’s “The High Sierra” book in hand, trying to figure out possible routes to interesting areas.

So I wrote a program to convert his data into a Google Earth-compatible KML file, which I could then use to visual the peak list in glorious 3D. The resulting file has proven very useful, so I thought I’d share it via this blog post – and provide a bit of commentary regarding the program/process at the same time.

Google Earth peaks

First, notes about the file:

  • You can download it here. Then just open it from Google Earth.
  • I use different color pushpins to denote the difficulty of reaching the summit. Green is for class 1 or 2, yellow for class 3, and red for class 4. I didn’t factor in the higher difficulty of the summit block, as many of the peaks are class 2 or 3 to the base of the summit block, but the block itself is class 4.
  • In the peak description, I tried to generate links to trip reports on Climber.org, but not all of these will be valid. Usually this is because the peak in question has no Climber.org trip report, but a few are due to issues with reverse-engineering the “shortened name” algorithm used at that site when grouping trip reports.
  • The same thing is true for links to Secor’s “The High Sierra” book at Google Books. I have page numbers, but not all pages are available (as one would expect), and sometimes the peak name used to highlight entries on the page won’t match the name that Secor used.

Next, some notes on the KML format:

  • The on-line documentation is really good, especially the KML Reference provided by Google.
  • I ran into a few minor problems, where no error would be reported by Google Earth when loading my file, but problems in the data meant that I wouldn’t see the expected result. For example, I’d accidentally specified the <color> value as hex-ified RGB (e.g. “ffffff” for white) instead of ABGR (alpha/blue/green/red), which needs eight hex digits. Also I’d added an <IconStyle> element with a an <href> child, but I needed to put the href inside of an <Icon> element. Minor things, but a bit frustrating to debug without any useful error being resported by Google Earth.
  • I wanted to use different built-in icons, but didn’t see a document listing all of these on Google. Eventually I found the list I needed in a Google forum post titled “Setting KML icon colors“.

I’ve posted source for the Java program used to generate the KML file. It’s located in my GitHub account, at the peaks2kml repository.

This Java program should have been trivial to write – basically convert from a text file dump of a database into the KML format. But I ran into one painful issue, which was converting from the NAD27 UTM locations into longitude/latitude. Seems like this bites everybody, and the lack of a universal, high quality Java package is frustrating.

I’m using the GeoTransform package, but I didn’t see a clean way to specify the source UTM datum as NAD27. I did figure out that the Clarke 1866 ellipsoid was the right one to use for conversion, and dumped out some results. I compared these with manual results from an excellent on-line UTM conversion page, and then used the delta (which appeared to be relatively constant) to adjust my results. Ugly, but close enough for a first cut.

And if I had to do it again, I’d probably use something like the KML beans (e.g. StyleType.java) from the Luzan project, and an XML package to convert the resulting object graph to a textual KML representation.


Yet another great git error message – expected sha/ref, got ‘

April 14, 2009

I’d been working away on the Bixo project, and pushing changes to GitHub without any problems.

Then I made the mistake of pulling in a new branch, versus creating the branch.

% git checkout origin cfetcher
% git pull

This merged the remote branch into my local master branch, with bizarre results. After a few attempts at trying to back it out, I blew away my local directory and just re-cloned the remote cfetcher branch, since that’s where I’d be working for the next few days. Unfortunately when I cloned it, I did:

% git clone git://github.com/emi/bixo.git

That created a clone using the GitHub “Public Clone URL”, not the “Your Clone URL”, which is git@github.com:emi/bixo.git. Oops.

Everything worked, though, until I wanted to push back some changes:

% git push
fatal: protocol error: expected sha/ref, got '
*********'

You can't push to git://github.com/user/repo.git
Use git@github.com:user/repo.git

*********'

Expected sha/ref? Though the error message had all of the info I needed, just not in a format that was obvious. For example, a good message would have said:

You can't push to git://github.com/emi/bixo.git
Update the url for the "origin" remote in your .git/config file to use git@github.com:emi/bixo.git

Eventually the Supercharged git-daemon blog post at GitHub cleared things up for me. I edited the URL entry in my .git/config file, and all is (once again) well.

[remote "origin"]
    url = git@github.com:emi/bixo.git
    fetch = +refs/heads/*:refs/remotes/origin/*

Merging in a GitHub fork

April 14, 2009

I’m working on a new project in GitHub called Bixo, and recently had to merge in a fork from Chris Wensel. After poking around on the web a bit, I found some very useful information in Willem’s blog post on Remote branches in git.

There was one minor error, though, in the “Merging back a fork” section. After the “git remote add…” command, you have to do a “git fetch <remote>” command to first fetch the remote branches before you can successfully do a “git branch <branch name> <remote/branch>” command.

So in my case, this meant:

% cd git/github/bixo
% git remote add chris git://github.com/cwensel/bixo.git
% git fetch chris
% git branch chris-fork chris/master

And once that worked, I could merge from his branch to mine, and push back the changes.

Slowly but surely the git model accretes in my head.


phishing and nslookup versus dig

March 30, 2009

Just this morning I got a phishing email targeting USAA customers:

Dear USAA Customer,

We would like to inform you that we have released a new version of USAA Confirmation Form. This form is required to be completed by all USAA customers. Please use the button below in order to access the form:

Access USAA Confrmation Form

hank you,

USAA

And yes, it had the typical phishing spelling errors. But what was interesting to me was the link from the “Access USAA…” text, which went to http://www.usaa.com.1l1ji.com/<more stuff>. Just for grins, I did an nslookup on 1ji.com, and got back:

Non-authoritative answer:
Name:    1ji.com
Address: 216.239.36.21
Name:    1ji.com
Address: 216.239.32.21
Name:    1ji.com
Address: 216.239.34.21
Name:    1ji.com
Address: 216.239.38.21

All four of those IP addresses are for Google in Mountain View, at least according to IP2Location. But when I did a dig, I got:

;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 12038
;; flags: qr rd ra; QUERY: 1, ANSWER: 0, AUTHORITY: 1, ADDITIONAL: 0

;; QUESTION SECTION:
;1j1.com.            IN    A

;; AUTHORITY SECTION:
1j1.com.        10800    IN    SOA    ns.dreger.de. admin.dreger.de. 2009021802 28800 7200 604800 86400

dreger.de is in Berlin, but there wasn’t much more information. Wish I understood better when there are differences betweeen nslookup and dig. I googled a bit on “10800 iIN SOA” but didn’t get any good hits.

Regardless, when I tried to visit the site to see what happened, Firefox conveniently blocked it:

FireFox Blocks Phishing Site

FireFox Blocks Phishing Site


Slide + Snow = Simply Fun

January 4, 2009

Our friends Matt & Susan (plus family) came to visit Nevada City shortly after a rare “big snow” day.

Matt posted a video of the kids having fun at a local school.

Wish I knew how to embed Facebook videos in a WordPress blog posting.


IndirecTV

January 4, 2009

We decided to upgrade our TV options, while punting on Netflix, as we found ourselves more interested in nature shows and football than movies.

After painfully working through the various packages from Comcast, Dish and DirecTV, I went with DirectTV. Slightly better price and a bit better selection of shows we wanted, plus a $10/mo rebate offer.

What a mistake. We should have stuck with Comcast.

All of these services have problems, but we jumped out of the frying pan and into the fire.

  • The installation was painful. I had to crawl around under the house, and up in the attic, to help the installer figure out how to tap into our existing cable setup and replace several connectors. This had all worked fine with Comcast, but caused problems for DirecTV.
  • For the first month, every 30 minutes or so we’d get a message on the screen about having to download new software. If you weren’t fast enough with the remote, the firmware upgrade would start, and you’d be stuck staring at a status bar instead of seeing the game-winning field goal.
  • Since the beginning we’ve had problems with the screen randomly freezing. Audio would continue, but eventually the entire system would lock up. Only solution is to reset it and wait for the “Acquiring satellites” process to complete. Again, while missing something really interesting that we were paying to not see.
  • And the clincher is that I realized our bill didn’t show the $10 monthly discount. Looking through the paperwork, I found the “apply for your monthly bill credit” slip with instructions. But you had to do this within 60 days of activation, and we were over the limit.

Just for grins, I asked about cancellation. For only $336, we could bail on what my wife calls IndirecTV.

In an attempt to salvage the situation, I’ve requested another visit by the installer to see if they can fix the problem (all of the self-help steps failed), and I’m writing to their corporate office to ask for a more reasonable rebate policy. If you’re in the same situation, their address is:

DirecTV
PO Box 6550
Greenwood Village, CO 80155-6550

– Ken

PS – I thought I’d found a kindred soul in Om Malik, but his “And Now InDirecTV” blog post was about their kludgy video on demand service.


Her first bug report

November 7, 2008

I remember the first time my daughter said “dada”, which was also her first word (though I think my wife disagrees).

And there’s another important milestone for Jenna – her first bug report :)

She came up to my office, complaining about a problem on the Bella Sara web site. At the time I was nostril-deep in Jira issue hell, so I told her I would be happy to look at it, just as soon as she filed a bug report. After 25 years I’ve perfected many techniques for blowing off pesky users.

A few minutes later she tapped me on the shoulder and handed over her first bug report:

Jenna's first bug report

Nice, step-by-step instructions with details required to reproduce it. I think there’s a potential career in QA.


Direct from Nevada City – Captains of Crush!

October 26, 2008

My friend Ron recently sent me an article from his Stanford magazine on Randall Strossen and the “Captains of Crush gripper” that his company (IronMind Enterprises) sells.

Captains of Crush Gripper

Captains of Crush Gripper

No, I’m not thinking of a new career in strength training – the tie-in is that Randall & IronMind are based in my small town of Nevada City.