While working on Bixo, I spent a fair amount of time trying to figure out how to avoid the multi-threaded complexity and memory-usage issues of the FetcherBuffer class that I wound up writing.
The FetcherBuffer takes care of setting up queues of URLs to be politely fetched, with one queue for each unique <IP address>+<crawl delay> combination. Then a queue of these queues is managed by the FetcherQueueMgr, which works with a thread pool to provide groups of URLs to be fetched by an available thread, when enough time has gone by since the last request to be considered polite.
But this approach means that in the reducer phase of a map-reduce job you have to create these queues, and then wait in the completion phase of the operation until all of them have been processed. Running multiple threads creates complexity and memory issues due to native memory stack space requirements, and having in-memory queues of URLs creates additional memory pressure.
So why can’t we just use Hadoop’s map-reduce support to handle all of this for us?
The key problem is that MR works well when each operation on a key/value pair is independent of any other key/value, and there are no external resource constraints.
But neither of those is true, especially during polite fetching.
For example, let’s say you implemented a mapper that created groups of 10 URLs, where each group was for the same server. You could easily process these groups in a reducer operation. This approach has two major problems, however.
First, you can’t control the interval between when groups for the same server would be processed. So you can wind up hitting a server to fetch URLs from a second group before enough time has expired to be considered polite, or worse yet you could have multiple threads hitting the same server at the same time.
Second, the maximum amount of parallelization would be equal to the number of reducers, which typically is something close to the number of ccores (servers * cores/server). So on a 10 server cluster w/dual cores, you’d have 20 threads active. But since most of the time during a fetch is spent waiting for the server to respond, you’re getting very low utilization of your available hardware & bandwidth. In Bixo, for example, a typical configuration is 300 threads/reducer.
Much of web crawling/mining maps well to a Hadoop map-reduce architecture, but fetching web pages unfortunately is a square peg in a round hole.
Regarding the “the maximum amount of parallelization would be equal to the number of reducers, which typically is something close to the number of ccores (servers * cores/server). So on a 10 server cluster w/dual cores, you’d have 20 threads active.” piece:
Do number from this example match what you’ve observed with Nutch? I don’t have an example of Nutch log handy, but it does include the number of active threads, so I’m wondering if in your experience it really matches the above theory(?)?
My comment about parallelism being limited by the number of reducers is based on a design goal of trying to use straight MR, without the complexity of additional threading support.
Since you don’t get sufficient parallelism via just the number of reducers, you have to deal with the added complexity of a multi-threaded reducer. Which is what Nutch and Bixo both do – but it’s not pretty 🙂
Regarding the “But since most of the time during a fetch is spent waiting for the server to respond, you’re getting very low utilization of your available hardware & bandwidth.” Isn’t the waiting time parallelized by MR? so it ends up faster?
Yes, MR will parallelize this wait time. But you can’t run many MR tasks/server, as each one runs in its own child JVM. So to make efficient use of your hardware, you need to achieve greater parallelism, which means layering a multi-threaded model on top of Hadoop.