Over at the Nutch mailing list, there are regular posts complaining about the performance of the new queue-based fetcher (aka Fetcher2) that became the default fetcher when Nutch 1.0 was released. For example:
Not sure if that problem is solved, I have it and reported it in a previous thread. Extremely fast fetch at the beginning and damn slow fetches after a while.
There’s also a Jira issue (NUTCH-721) filed on the problem.
But in my experience using Nutch to do vertical/focused crawls, this problem of having very slow fetch performance at the end of a crawl is a fundamental problem caused by not enough unique domains. If a crawler is polite, then once the number of unique domains drops significantly (because you’ve fetched all of the URLs for most of the domains), the fetch performance always drops rapidly, at least if your crawler is properly obeying robots.txt and the default rules for polite crawling.
Just for grins, I tracked a number of metrics at the tail end of a vertical crawl I was just doing using Bixo – that’s the vertical crawler toolkit I’ve been working on for the past two months. The system configuration (in Amazon’s EC2) is an 11 server cluster (1 master, 10 slaves) using the small EC2 instance. I run 2 reducers per server, with a maximum of 200 fetcher threads per reducer. So the theoretical maximum is 4000 active fetch threads, which is way more than I needed, but I was also testing memory usage (primarily kernel memory) of threads, so I’d cranked this way up.
I started out with 1,264,539 URLs from 41,978 unique domains, where I classify domains using the “paid level” ontology as described in the IRLbot paper. So http://www.ibm.com, blogs.us.ibm.com, and ibm.com are all the same domain.
Here’s the performance graph after one hour, which is when the crawl seemed to enter the “long tail” fetch phase…
The key things to note from this graph are:
- The 41K unique domains were down to 1700 after an hour, and then slowly continued to drop. This directly impacts the number of simultaneous fetches that can politely execute at the same time. In fact there were only 240 parallel fetches (== 240 domains) after an hour, and 64 after three hours.
- Conversely, the average number of URLs per domain climbs steadily, which means the future fetch rate will continue to drop.
- And so it does, going from almost 9K/second (scaled to 10ths of second in the graph) after one hour down to 7K/second after four hours.
I think this represents a typical vertical/focused crawl, where a graph of the number of URLs/domain would show a very strong exponential decay. So once you’ve fetched the single URLs from a lot of different domains, you’re left with lots of URLs for a much smaller number of domains. And your performance will begin to stink.
The solution I’m using in Bixo is to specify the target fetch duration. From this, I can estimate the number of URLs per domain I might be able to get, and so I pre-prune the URLs put into each domain’s fetch queue. This works well for the type of data processing workflow that the current commercial users of Bixo need, where Bixo is a piece in the data processing pipeline that needs to play well (ie doesn’t stall the entire process).
Anyway, I keep thinking that perhaps some of the reported problems with Nutch’s Fetcher2 are actually a sign that the fetcher is being appropriately polite, and the comparison with the old fetcher is flawed because that version had bugs where it would act impolitely.
I’ve definitely observed the slow long tail. As a matter of fact, I think I submitted some patches to deal with this a while back. If I recall correctly, I think I added something that had logic that at some point simply skipped a bunch of URLs in order to avoid getting stuck forever with a small number of slow hosts. Aha, yes, I think I had something like the minimal fetch rate setting. I’d keep track of the fetch rate and if I detected a slow host, I’d skip all its remaining URLs. I *think* that’s in Nutch 1.0…
As for Fetcher2 slowness, I don’t think it has to do with what you said. I think so because I see Roger Dunk used generate.max.per.host = 1 in https://issues.apache.org/jira/browse/NUTCH-721 .
I missed that Roger was setting things up to only have one URL per host…thanks.
Re fetch termination – terminating slow servers definitely helps. I’m dealing with a slightly different problem now, where the partner crawl has a big exponential decay in the URLs/domain graph. A few domains have 10K URLs, while most of the 50K unique domains have but one.
Since it’s a “partner” crawl, I’m adding a PartnerFetchPolicy that adaptively cranks down the crawl delay (to some minimum value), with the calculated value based on trying to crawl all of the ULRs in the target fetch duration. I’ll post on how well that works soon.
What about IScoreGenerator and IGroupingKeyGenerator in BIXO, it can be done “up to 1000 URLs from same domain in a single iteration” easily, so that we will have a tail at the very last iteration. However, we may have different crawl-depths on a single fetch step (one of domains have 2000 links from 100 pages, and another one 1000 links from 100 pages).
Another solution would be to avoid sleeping Threads at all. We need partitioner (Grouping Key) such a way that first N domains go to host 1, second N to host 2, etc; single page from a single domain for each iteration (could be 10 pages with Keep-Alive). Different architecture; we don’t have Queue of URLs for same domain in this case, and threads should not be worried about crawl-delay at all, it will be few minutes in practice… and…
…this shows the main problem of Cascading-type of architecture: resource usage is not optimized.
Imagine 100 Hadoop hosts sharing 100Mbps network connectivity. Fetch is I/O bound, 100% network wait, and 0.1% CPU time.
After all, we spend some time on Fetch, and some another time on Parse, and during Parse our network usage is 0 and CPU 100%, but during fetch network 100% and CPU 0%.
As I see from Google and even Yahoo robot logs they are constantly crawling my website without few-hours delay on parsing, indexing, preparing fetch list, and etc.
How to distribute load evenly?
I don’t understand your question about IScoreGenerator and IGroupingKeyGenerator – maybe post that to the bixo-dev mailing list with more details?
As to the second suggestion, trying to better use the Hadoop map-reduce support via a smarter partitioner is something that we spent time thinking about, but I don’t see how it could possibly work. I’ll write a blog post about that soon, to address the reasons why in greater depth.
Just blogged about this issue – see https://ken-blog.krugler.org/2009/12/12/why-fetching-web-pages-doesnt-map-well-to-map-reduce/
Ken, at this point, with various semi-recent patches applied to Nutch to address some of these long-tail issues, would you say Bixo is still a better tool to use for vertical crawls? If so, what are the “remaining” aspects of Bixo that Nutch still misses to be equally good for such crawls?
I haven’t been closely tracking Nutch patches, though I see that there have been improvements to how the plugins can communicate information – which was one of the major headaches previously.
As to a comparison, I think the recent post by Stefano Cherchi on the Nutch list is a good example of where Bixo is much easier to use than Nutch. I had to deal with almost exactly the same use case he describes, where you have top-level pages that have links to the real content, but you have to crawl (and not index) these top pages first.
In Bixo I created a workflow that took top level URLs in, fetched them, used a modified parser to extract the links, then fed these into a second fetch pipe, which in turn fed the results into the parser.
So I wound up with one very simple (conceptually) Cascading workflow, which was very reliable…e.g. no error-prone editing of config files in between runs.
But it would be interesting to do a side-by-side comparison of Bixo & Nutch for some vertical crawl project. If you have something in mind, I’d be interested in giving that a try.
[…] https://ken-blog.krugler.org/2009/05/19/performance-problems-with-verticalfocused-web-crawling/ […]
[…]Performance problems with vertical/focused web crawling « Ken's Techno Tidbits[…]…