Why fetching web pages doesn’t map well to map-reduce

December 12, 2009

While working on Bixo, I spent a fair amount of time trying to figure out how to avoid the multi-threaded complexity and memory-usage issues of the FetcherBuffer class that I wound up writing.

The FetcherBuffer takes care of setting up queues of URLs to be politely fetched, with one queue for each unique <IP address>+<crawl delay> combination. Then a queue of these queues is managed by the FetcherQueueMgr, which works with a thread pool to provide groups of URLs to be fetched by an available thread, when enough time has gone by since the last request to be considered polite.

But this approach means that in the reducer phase of a map-reduce job you have to create these queues, and then wait in the completion phase of the operation until all of them have been processed. Running multiple threads creates complexity and memory issues due to native memory stack space requirements, and having in-memory queues of URLs creates additional memory pressure.

So why can’t we just use Hadoop’s map-reduce support to handle all of this for us?

The key problem is that MR works well when each operation on a key/value pair is independent of any other key/value, and there are no external resource constraints.

But neither of those is true, especially during polite fetching.

For example, let’s say you implemented a mapper that created groups of 10 URLs, where each group was for the same server. You could easily process these groups in a reducer operation. This approach has two major problems, however.

First, you can’t control the interval between when groups for the same server would be processed. So you can wind up hitting a server to fetch URLs from a second group before enough time has expired to be considered polite, or worse yet you could have multiple threads hitting the same server at the same time.

Second, the maximum amount of parallelization would be equal to the number of reducers, which typically is something close to the number of ccores (servers * cores/server). So on a 10 server cluster w/dual cores, you’d have 20 threads active. But since most of the time during a fetch is spent waiting for the server to respond, you’re getting very low utilization of your available hardware & bandwidth. In Bixo, for example, a typical configuration is 300 threads/reducer.

Much of web crawling/mining maps well to a Hadoop map-reduce architecture, but fetching web pages unfortunately is a square peg in a round hole.


Using WordPress for web site but keeping mail separate

November 19, 2009

I use WordPress.com to host a number of web sites, and for simple stuff it’s great.

But I ran into a problem with keeping email separate, so I thought I’d share what I learned.

Here’s the background. I wanted to have http://bixolabs.com and http://www.bixolabs.com both wind up at the web site being hosted by WordPress.com. But I wanted to keep my email separate, versus using the GMail-only approach supported by WordPress.

According to WordPress documentation, you can’t do this. They say:

Changing the name servers will make any previously setup custom DNS records such as A, CNAME, or MX records stop working, and we do not have an option for you to create custom DNS records here. If you already have email configured on your domain, you must either switch to Custom Email with Google Apps or you can use a subdomain instead which doesn’t require changing the name servers.

This meant that I couldn’t just change my name server to WordPress, as they don’t support any customization.

But if I keep my own DNS configuration, then all I can do is use a CNAME record to map a subdomain to WordPress. And you can’t treat “www” as a subdomain.

So my first attempt was to configure my DNS record as follows:

  • www -> [URL redirect] -> http://bixolabs.com
  • @ -> [CNAME] -> bixolabs.wordpress.com
  • @ -> [MX] -> <my hoster’s mail server IP address>

This worked pretty well. http://www.bixolabs.com got redirected to bixolabs.com, and bixolabs.com mapped to the bixolabs site at WordPress.com.

But the http://www.bixolabs.com redirect was a temp redirect (HTTP 302 status) not a permanent redirect (HTTP 301 status), so I was losing some SEO “juice” due to how Google and others interpret temp vs. perm redirects.

I fixed this by having my hoster set up their Apache server to do a permanent redirect, and changing the entry for www to point to the Apache server’s IP address.

But there was a bigger, hidden problem. Occasionally people would complain about getting email bounces, when they tried to reply to one of my emails. The reply-to address in my email would be ken@bixolabs.com, but the To: field in their reply would be set to ken@lb.wordpress.com.

Eventually I figured out the problem. It’s technically not valid to have both a CNAME and an MX DNS entry for the same domain (or sub-domain, I assume). If a mail client does a lookup on the reply-to domain, bixolabs.com has the canonical address of “lb.wordpress.com”, since the CNAME entry overrides the MX entry.

The fix for this involved three steps. First, I changed the MX entry in my DNS setup to use “mail”, not “@”. Then I changed my email client reply-to address to use mail.bixolabs.com, not just bixolabs.com. And finally, my hoster had to configure their mail server to recognize mail.bixolabs.com as a valid domain, not just bixolabs.com.

 


Wikipedia Love

November 16, 2009
Wikipedia Affiliate Button Normally we wait until the end of the year to figure out our charitable donations, but I’ve been using Wikipedia so much over the past few days that I felt like I needed to donate today.

The WordPress Business Model

September 11, 2009

I think I finally understand how hosted WordPress makes money 🙂

I recently set up a web site for my dad’s consulting business, at KruglerEngineeringGroup.com. I used the WordPress hosted service, and a flexible, business-oriented theme called Vigilance.

But I needed to tweak the colors to get a solid background, with white-on-blue text. It was pretty easy (using Firebug) to figure out the CSS changes required, and I could edit these in the WordPress Custom CSS form, and I got the look I wanted – so the hook was set. Now I just need to pay for the $14.97/year “upgrade” to be able to save and use the custom CSS.

Which I gladly did, since it would be way more expensive for me in time and hassle to try to set this up in my own WordPress environment.

Step 2 was connecting his existing KruglerEngineeringGroup.com domain to the WordPress site. A few clicks on the WordPress.com site, another modest yearly payment of $9.97 (where do they get these amounts?), and we were almost all set. The one minor difficulty was in handling the “www” subdomain. WordPress says that if you want this to work, you need to change the domain name servers to use their name servers. But the current domain needs to use a specific email server (MX record).

So the solution was to create two DNS entries in the current name server config. One was the standard WordPress entry for subdomains, where you create a CNAME record that maps “@” to kruglerengineeringgroup.wordpress.com. The second entry mapped “www” as a URL redirect to http://kruglerengineeringgroup.com. Once that propagated, everything worked as planned. A few hours of my time, and $24.94/year to WordPress.


Yet another great git error message – expected sha/ref, got ‘

April 14, 2009

I’d been working away on the Bixo project, and pushing changes to GitHub without any problems.

Then I made the mistake of pulling in a new branch, versus creating the branch.

% git checkout origin cfetcher
% git pull

This merged the remote branch into my local master branch, with bizarre results. After a few attempts at trying to back it out, I blew away my local directory and just re-cloned the remote cfetcher branch, since that’s where I’d be working for the next few days. Unfortunately when I cloned it, I did:

% git clone git://github.com/emi/bixo.git

That created a clone using the GitHub “Public Clone URL”, not the “Your Clone URL”, which is git@github.com:emi/bixo.git. Oops.

Everything worked, though, until I wanted to push back some changes:

% git push
fatal: protocol error: expected sha/ref, got '
*********'

You can't push to git://github.com/user/repo.git
Use git@github.com:user/repo.git

*********'

Expected sha/ref? Though the error message had all of the info I needed, just not in a format that was obvious. For example, a good message would have said:

You can't push to git://github.com/emi/bixo.git
Update the url for the "origin" remote in your .git/config file to use git@github.com:emi/bixo.git

Eventually the Supercharged git-daemon blog post at GitHub cleared things up for me. I edited the URL entry in my .git/config file, and all is (once again) well.

[remote "origin"]
    url = git@github.com:emi/bixo.git
    fetch = +refs/heads/*:refs/remotes/origin/*

Merging in a GitHub fork

April 14, 2009

I’m working on a new project in GitHub called Bixo, and recently had to merge in a fork from Chris Wensel. After poking around on the web a bit, I found some very useful information in Willem’s blog post on Remote branches in git.

There was one minor error, though, in the “Merging back a fork” section. After the “git remote add…” command, you have to do a “git fetch <remote>” command to first fetch the remote branches before you can successfully do a “git branch <branch name> <remote/branch>” command.

So in my case, this meant:

% cd git/github/bixo
% git remote add chris git://github.com/cwensel/bixo.git
% git fetch chris
% git branch chris-fork chris/master

And once that worked, I could merge from his branch to mine, and push back the changes.

Slowly but surely the git model accretes in my head.


phishing and nslookup versus dig

March 30, 2009

Just this morning I got a phishing email targeting USAA customers:

Dear USAA Customer,

We would like to inform you that we have released a new version of USAA Confirmation Form. This form is required to be completed by all USAA customers. Please use the button below in order to access the form:

Access USAA Confrmation Form

hank you,

USAA

And yes, it had the typical phishing spelling errors. But what was interesting to me was the link from the “Access USAA…” text, which went to http://www.usaa.com.1l1ji.com/<more stuff>. Just for grins, I did an nslookup on 1ji.com, and got back:

Non-authoritative answer:
Name:    1ji.com
Address: 216.239.36.21
Name:    1ji.com
Address: 216.239.32.21
Name:    1ji.com
Address: 216.239.34.21
Name:    1ji.com
Address: 216.239.38.21

All four of those IP addresses are for Google in Mountain View, at least according to IP2Location. But when I did a dig, I got:

;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 12038
;; flags: qr rd ra; QUERY: 1, ANSWER: 0, AUTHORITY: 1, ADDITIONAL: 0

;; QUESTION SECTION:
;1j1.com.            IN    A

;; AUTHORITY SECTION:
1j1.com.        10800    IN    SOA    ns.dreger.de. admin.dreger.de. 2009021802 28800 7200 604800 86400

dreger.de is in Berlin, but there wasn’t much more information. Wish I understood better when there are differences betweeen nslookup and dig. I googled a bit on “10800 iIN SOA” but didn’t get any good hits.

Regardless, when I tried to visit the site to see what happened, Firefox conveniently blocked it:

FireFox Blocks Phishing Site

FireFox Blocks Phishing Site


Fun with trebuchets

August 23, 2008

My friends Ron & Eleanor was coming to visit us on the 4th of July, and I thought that building a trebuchet would be a fun activity for Jenna & their two boys.

There are lots of resources on the web for trebuchet designs, but I wanted something that we could quickly whip up in a few hours, but that would be more than a toy model. After a critical video iChat with my dad, we had a list of materials and were off to the hardware/lumber store.

I used the MacTreb program (downloaded from http://www.algobeautytreb.com/index.html, but seems to be off-line right now) to simulate the design. Though in the end, I think the light weight of the tennis balls we were using meant that air resistance played a major role in distance, thus the range calculated (370ft) was much greater than what we achieved (about 250ft).

Here’s the end result:

trebuchet

Ron & Finished Trebuchet

We used a 20lb dumbbell as the counter weight. The throwing arm (8ft x 1″ x 4″ pine) has 2″x4″ blocks attached on either side of the pivot point, then we screwed 3/8″ lag bolts in from either side. These rotate in similar sized holes drilled in the vertical support boards (5ft x 1″ x 6″ pine).

The sling was made out of an old plastic shower curtain, using a pattern I found on-line…but I can’t locate it right now.

Finally, there’s the always-fun release mechanism – both for the throwing arm, and the sling. I put two hook screws in, and that worked well enough as a way of keeping the arm held down while I fiddled with the sling.

The sling itself was permanently attached on one end, and had a link from a chain that slid off an angled metal rod. Adjusting the angle of the rod let me tune when the ball would actually come free from the sling, and thus the angle of release.

Throwing arm & sling release mechanisms

Throwing arm & sling release mechanisms

In the end it all worked, though my wife was less than impressed – she had visions of us launching bowling balls over the neighbor’s roof. Things to change for version two:

  • Throw a denser object – a lacrosse ball feels about right. The tennis ball just slowed down too fast once it came out of the sling.
  • Make the sling pouch size a bit bigger. We sometimes had launch failures when the tennis ball would pop out just as the sling started moving.
  • Use thinner & lighter cord for the sling. The rope we used seemed a bit out-of-scale with the weight of the object we were throwing.
  • Angle the support boards in from a wider base. The dumbbell almost didn’t fit, and sometimes hit the boards during a launch. And we wanted to add more weight, but there wasn’t a good way to tie on the second hand weight without adding any width.
  • Add bracing from the back of the main base board to the upright boards. The trebuchet would rock back and forth without this bracing.

But all in all it was a fun activity for a 4th of July weekend – highly recommended!


CodeRage II – sometimes things don’t go well

December 4, 2007

CodeRage is the name of the virtual developer conference sponsored by CodeGear, the spin-out from Borland that handles all of their tools like Delphi and JBuilder.

This year I gave a talk titled “Impact Analysis for the Rest of Us” It was about how impact analysis is something every developer does, every day. And yet current impact analysis tools focus on the architect level, the Big Bang type of change.

Anyway, I thought it was an interesting and useful topic, though perhaps I’m a bit blinded by my involvement with Krugle. Apparently it didn’t rate so high with the CodeGear crowd – I got assigned the first slot (7:15am) on the last day (Friday). And then at the end of my presentation there was some confusion about how to patch me in via Skype, even though I’d worked through the issues with the conference team during an earlier training session.

Which means the 3 people who watched the video I’d slaved over (see my post about much pain & suffering while creating it) weren’t able to ask questions, even if they’d had any.

At least I’ve got a video that might be useful in the future, so I’m going to focus on the positives.


Video Editing Round 3 – InterWise Strikes Back

November 19, 2007

I just finished testing my CodeRage II presentation – or testing it as best I could, since InterWise doesn’t work with a Mac.

But somebody else on the training chat (Bob Swart, aka “Dr. Bob“) helpfully took a screenshot so I could see what he meant by “funny colors” in the video:

Mangled Video from InterWise

Nice, huh? Best I can tell is that InterWise is down-sampling the video from 1000s of colors (16 bit) to 256 colors (8-bit indexed), and that conversion isn’t working as well as it ought to. Bummer. But at least it’s still legible, and the audio seems fine. So no ugly editing required.