Over at the Nutch mailing list, there are regular posts complaining about the performance of the new queue-based fetcher (aka Fetcher2) that became the default fetcher when Nutch 1.0 was released. For example:
Not sure if that problem is solved, I have it and reported it in a previous thread. Extremely fast fetch at the beginning and damn slow fetches after a while.
There’s also a Jira issue (NUTCH-721) filed on the problem.
But in my experience using Nutch to do vertical/focused crawls, this problem of having very slow fetch performance at the end of a crawl is a fundamental problem caused by not enough unique domains. If a crawler is polite, then once the number of unique domains drops significantly (because you’ve fetched all of the URLs for most of the domains), the fetch performance always drops rapidly, at least if your crawler is properly obeying robots.txt and the default rules for polite crawling.
Just for grins, I tracked a number of metrics at the tail end of a vertical crawl I was just doing using Bixo – that’s the vertical crawler toolkit I’ve been working on for the past two months. The system configuration (in Amazon’s EC2) is an 11 server cluster (1 master, 10 slaves) using the small EC2 instance. I run 2 reducers per server, with a maximum of 200 fetcher threads per reducer. So the theoretical maximum is 4000 active fetch threads, which is way more than I needed, but I was also testing memory usage (primarily kernel memory) of threads, so I’d cranked this way up.
I started out with 1,264,539 URLs from 41,978 unique domains, where I classify domains using the “paid level” ontology as described in the IRLbot paper. So http://www.ibm.com, blogs.us.ibm.com, and ibm.com are all the same domain.
Here’s the performance graph after one hour, which is when the crawl seemed to enter the “long tail” fetch phase…
The key things to note from this graph are:
- The 41K unique domains were down to 1700 after an hour, and then slowly continued to drop. This directly impacts the number of simultaneous fetches that can politely execute at the same time. In fact there were only 240 parallel fetches (== 240 domains) after an hour, and 64 after three hours.
- Conversely, the average number of URLs per domain climbs steadily, which means the future fetch rate will continue to drop.
- And so it does, going from almost 9K/second (scaled to 10ths of second in the graph) after one hour down to 7K/second after four hours.
I think this represents a typical vertical/focused crawl, where a graph of the number of URLs/domain would show a very strong exponential decay. So once you’ve fetched the single URLs from a lot of different domains, you’re left with lots of URLs for a much smaller number of domains. And your performance will begin to stink.
The solution I’m using in Bixo is to specify the target fetch duration. From this, I can estimate the number of URLs per domain I might be able to get, and so I pre-prune the URLs put into each domain’s fetch queue. This works well for the type of data processing workflow that the current commercial users of Bixo need, where Bixo is a piece in the data processing pipeline that needs to play well (ie doesn’t stall the entire process).
Anyway, I keep thinking that perhaps some of the reported problems with Nutch’s Fetcher2 are actually a sign that the fetcher is being appropriately polite, and the comparison with the old fetcher is flawed because that version had bugs where it would act impolitely.