Git and unreferenced blobs and Stack Overflow

December 16, 2009

I ran into a problem yesterday, while trying to prune the size of the Bixo repo at GitHub (450MB, ouch).

I deleted the release branch, first on GitHub and then locally. This is what contained a bunch of big binary blobs (release distribution jars). But even after this work, I still had 250MB+ in my local & GitHub repository.

Following some useful steps I found on Stack Overflow, I could isolate the problem down to a few unreferenced blobs. By “unreferenced” I mean these were blobs with SHA1s that could not be located anywhere in the git tree/history by the various scripts I found on Stack Overflow.

I posted a question about this to Stack Overflow, and got some very useful answers, though nothing that directly solved the problem. But it turns out that a fresh clone from GitHub is much smaller, and these dangling blobs are gone. So I think it’s a git bug, where these blobs get left around locally but are correctly cleared from the remote repo.

But I ran into a new problem today with my local Bixo repo, where I couldn’t push changes. I’d get this output from my “git push” command:

Counting objects: 92, done.
Delta compression using 2 threads.
Compressing objects: 100% (53/53), done.
Writing objects: 100% (57/57), 11.50 KiB, done.
Total 57 (delta 28), reused 0 (delta 0)
error: insufficient permission for adding an object to repository database ./objects

fatal: failed to write object
error: unpack-objects exited with error code 128
error: unpack failed: unpack-objects abnormal exit
To git@github.com:bixo/bixo.git
 ! [remote rejected] master -> master (n/a (unpacker error))
error: failed to push some refs to 'git@github.com:bixo/bixo.git'

No solutions came up while searching, but the problem doesn’t exist for a fresh clone, so I’m manually migrating my changes over to the fresh copy, and then I’ll happily delete my apparently messed up older local git repo and move on to more productive uses of my time.

[UPDATE: The problem does actually exist in a fresh clone, but only for the second push. Eventually GitHub support resolved the issue by fixing permissions of some files on their side of the fence. Apparently things got "messed up" during the fork from the original EMI/bixo repo]


Why fetching web pages doesn’t map well to map-reduce

December 12, 2009

While working on Bixo, I spent a fair amount of time trying to figure out how to avoid the multi-threaded complexity and memory-usage issues of the FetcherBuffer class that I wound up writing.

The FetcherBuffer takes care of setting up queues of URLs to be politely fetched, with one queue for each unique <IP address>+<crawl delay> combination. Then a queue of these queues is managed by the FetcherQueueMgr, which works with a thread pool to provide groups of URLs to be fetched by an available thread, when enough time has gone by since the last request to be considered polite.

But this approach means that in the reducer phase of a map-reduce job you have to create these queues, and then wait in the completion phase of the operation until all of them have been processed. Running multiple threads creates complexity and memory issues due to native memory stack space requirements, and having in-memory queues of URLs creates additional memory pressure.

So why can’t we just use Hadoop’s map-reduce support to handle all of this for us?

The key problem is that MR works well when each operation on a key/value pair is independent of any other key/value, and there are no external resource constraints.

But neither of those is true, especially during polite fetching.

For example, let’s say you implemented a mapper that created groups of 10 URLs, where each group was for the same server. You could easily process these groups in a reducer operation. This approach has two major problems, however.

First, you can’t control the interval between when groups for the same server would be processed. So you can wind up hitting a server to fetch URLs from a second group before enough time has expired to be considered polite, or worse yet you could have multiple threads hitting the same server at the same time.

Second, the maximum amount of parallelization would be equal to the number of reducers, which typically is something close to the number of ccores (servers * cores/server). So on a 10 server cluster w/dual cores, you’d have 20 threads active. But since most of the time during a fetch is spent waiting for the server to respond, you’re getting very low utilization of your available hardware & bandwidth. In Bixo, for example, a typical configuration is 300 threads/reducer.

Much of web crawling/mining maps well to a Hadoop map-reduce architecture, but fetching web pages unfortunately is a square peg in a round hole.


Using WordPress for web site but keeping mail separate

November 19, 2009

I use WordPress.com to host a number of web sites, and for simple stuff it’s great.

But I ran into a problem with keeping email separate, so I thought I’d share what I learned.

Here’s the background. I wanted to have http://bixolabs.com and http://www.bixolabs.com both wind up at the web site being hosted by WordPress.com. But I wanted to keep my email separate, versus using the GMail-only approach supported by WordPress.

According to WordPress documentation, you can’t do this. They say:

Changing the name servers will make any previously setup custom DNS records such as A, CNAME, or MX records stop working, and we do not have an option for you to create custom DNS records here. If you already have email configured on your domain, you must either switch to Custom Email with Google Apps or you can use a subdomain instead which doesn’t require changing the name servers.

This meant that I couldn’t just change my name server to WordPress, as they don’t support any customization.

But if I keep my own DNS configuration, then all I can do is use a CNAME record to map a subdomain to WordPress. And you can’t treat “www” as a subdomain.

So my first attempt was to configure my DNS record as follows:

  • www -> [URL redirect] -> http://bixolabs.com
  • @ -> [CNAME] -> bixolabs.wordpress.com
  • @ -> [MX] -> <my hoster’s mail server IP address>

This worked pretty well. www.bixolabs.com got redirected to bixolabs.com, and bixolabs.com mapped to the bixolabs site at WordPress.com.

But the www.bixolabs.com redirect was a temp redirect (HTTP 302 status) not a permanent redirect (HTTP 301 status), so I was losing some SEO “juice” due to how Google and others interpret temp vs. perm redirects.

I fixed this by having my hoster set up their Apache server to do a permanent redirect, and changing the entry for www to point to the Apache server’s IP address.

But there was a bigger, hidden problem. Occasionally people would complain about getting email bounces, when they tried to reply to one of my emails. The reply-to address in my email would be ken@bixolabs.com, but the To: field in their reply would be set to ken@lb.wordpress.com.

Eventually I figured out the problem. It’s technically not valid to have both a CNAME and an MX DNS entry for the same domain (or sub-domain, I assume). If a mail client does a lookup on the reply-to domain, bixolabs.com has the canonical address of “lb.wordpress.com”, since the CNAME entry overrides the MX entry.

The fix for this involved three steps. First, I changed the MX entry in my DNS setup to use “mail”, not “@”. Then I changed my email client reply-to address to use mail.bixolabs.com, not just bixolabs.com. And finally, my hoster had to configure their mail server to recognize mail.bixolabs.com as a valid domain, not just bixolabs.com.

 


Wikipedia Love

November 16, 2009
Wikipedia Affiliate Button Normally we wait until the end of the year to figure out our charitable donations, but I’ve been using Wikipedia so much over the past few days that I felt like I needed to donate today.

Bixolabs goes public

November 2, 2009

I’ve been working on an elastic web mining platform for a few months now, and it was finally time to go public with at least the current state of the union.

I gave a talk at the ACM Data Mining Unconference on Sunday, where I also announced the Public Terabyte Dataset project, so the timing was perfect.

If you want to know what’s been keeping me busy, and looks to be part of my future, check out http://bixolabs.com.

 


Fixing Firefox default monitor

October 26, 2009

I’m running Firefox 3.0.14 on Mac OS X 10.5.

I’ve got a MacBook laptop and a 24″ LCD display as my normal configuration, though sometimes I’m just using the laptop.

Whenever I open a new browser window, it defaults to the laptop display, not the big LCD, even though that’s my main screen.

I searched the forums, and didn’t find any good solution, so here’s what worked for me:

  1. Quit Firefox
  2. Locate the localstore.rdf file in your Firefox profile directory. This will be in the ~/Library/Application Support/Firefox/Profiles/<random string>.default/ directory.
  3. Open it with your favorite text editor.
  4. Find the RDF section with the description set to “chrome://browser/content/browser.xul#main-window”
  5. Set the screenX and screenY values to 0.
  6. Save the file.
  7. Restart Firefox

In my case, for example, the prior contents of this file were:

<RDF:Description RDF:about="chrome://browser/content/browser.xul#main-window"
    height="778"
    screenX="-1273"
    screenY="401"
    width="1276"
    sizemode="maximized" />

By setting screenX=”0″ and screenY=”0″, I was able to fix my problem.


The WordPress Business Model

September 11, 2009

I think I finally understand how hosted WordPress makes money :)

I recently set up a web site for my dad’s consulting business, at KruglerEngineeringGroup.com. I used the WordPress hosted service, and a flexible, business-oriented theme called Vigilance.

But I needed to tweak the colors to get a solid background, with white-on-blue text. It was pretty easy (using Firebug) to figure out the CSS changes required, and I could edit these in the WordPress Custom CSS form, and I got the look I wanted – so the hook was set. Now I just need to pay for the $14.97/year “upgrade” to be able to save and use the custom CSS.

Which I gladly did, since it would be way more expensive for me in time and hassle to try to set this up in my own WordPress environment.

Step 2 was connecting his existing KruglerEngineeringGroup.com domain to the WordPress site. A few clicks on the WordPress.com site, another modest yearly payment of $9.97 (where do they get these amounts?), and we were almost all set. The one minor difficulty was in handling the “www” subdomain. WordPress says that if you want this to work, you need to change the domain name servers to use their name servers. But the current domain needs to use a specific email server (MX record).

So the solution was to create two DNS entries in the current name server config. One was the standard WordPress entry for subdomains, where you create a CNAME record that maps “@” to kruglerengineeringgroup.wordpress.com. The second entry mapped “www” as a URL redirect to http://kruglerengineeringgroup.com. Once that propagated, everything worked as planned. A few hours of my time, and $24.94/year to WordPress.


Local mines near Nevada City

August 25, 2009

While searching (in vain) for a cool new domain name, I stumbled upon the CaliforniaMaps.org web site. As you can see from the snapshot below, there’s lots of local places to go looking for empty mine shafts to fall down:

Nevada County Mines

Check out the interactive version here.


A Man and his EuroVan Camper

August 8, 2009

After 10 years of on-and-off discussion, we finally took the plunge and bought a 1997 EuroVan Camper – or 97EVC for members of the club.

EuroVan Camper at Westport

EuroVan Camper at Westport

Being able to pull in, pop the top, and kick back was huge.

Though we’re still working out the kinks in our travel setup and procedures – as a fellow EVCer said, it’s like living in a sailboat. You have to plan things a few moves in advance, so you don’t wind with the bed extended and the (blocked) cabinet containing your toothbrush.

And thank goodness for the EVC Yahoo group – without their help, I would have been totally stuck.

We’ve got a typical list of things to fix, buy, and figure out before the next big trip:

  • The dreaded Norcold refrigerator stopped working on propane. And the burner on light fell out (again).
  • Cruise control stopped working.
  • There’s a small coolant leak.
  • The driver’s side windshield wiper fluid doesn’t squirt.
  • There’s a new crack in the windshield.
  • The headlight low beams are way too low.
  • The rear (hatch) door sometimes doesn’t unlock.

But in spite of the problems, we had a great time. And we would have never spent two wonderful days camping in the redwoods at Humboldt State Park, or seen this amazing memorial to the town of Pepperwood, which was wiped out in the 1964 Eel River flood.

Memorial for town of Pepperwood

Memorial for town of Pepperwood

The caption says

To Pepperwood

And It’s Loved Ones

Gone but not forgotten

Presented by

Fortuna Chamber of Commerce

Inquiries to the Fortuna CoC for background and grammar checks have gone unanswered.


Why I buy from Patagonia

May 27, 2009

Yes, it costs more for Patagonia. But the way they treat me as a customer makes me happy to pay a premium…as my latest experience shows.

I had a pair of Patagonia gortex pants from way-back-when. Worked fine, though my duct tape patch job ruined the clean lines – I’d accidentally stuck my ice axe through the pants and into my left leg, instead of the glacier, during a glissade off Rainier.

And then this past snowboarding season some seam sealing tape started coming off, so things began to get a bit wet at times. I sent the pants to Patagonia, with a note explaining that I’d also be happy to pay for a real repair job of my ice axe mishap.

Yesterday I got a Patagonia gift card in the mail, for $238.44. No idea how they calculated that amount, but I’m looking forward to buying a replacement pair of pants. And they’ve reaffirmed my belief that paying for quality gear winds up being cheaper in the end.