<?xml version="1.0" encoding="UTF-8"?><rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:georss="http://www.georss.org/georss" xmlns:geo="http://www.w3.org/2003/01/geo/wgs84_pos#" xmlns:media="http://search.yahoo.com/mrss/"
		>
<channel>
	<title>Comments on: Why fetching web pages doesn&#8217;t map well to map-reduce</title>
	<atom:link href="http://ken-blog.krugler.org/2009/12/12/why-fetching-web-pages-doesnt-map-well-to-map-reduce/feed/" rel="self" type="application/rss+xml" />
	<link>http://ken-blog.krugler.org/2009/12/12/why-fetching-web-pages-doesnt-map-well-to-map-reduce/</link>
	<description></description>
	<lastBuildDate>Fri, 20 Aug 2010 02:46:48 +0000</lastBuildDate>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.com/</generator>
	<item>
		<title>By: kkrugler</title>
		<link>http://ken-blog.krugler.org/2009/12/12/why-fetching-web-pages-doesnt-map-well-to-map-reduce/#comment-560</link>
		<dc:creator>kkrugler</dc:creator>
		<pubDate>Mon, 09 Aug 2010 21:11:56 +0000</pubDate>
		<guid isPermaLink="false">http://ken-blog.krugler.org/?p=197#comment-560</guid>
		<description>Hi Xiaomeng,

Yes, MR will parallelize this wait time. But you can&#039;t run many MR tasks/server, as each one runs in its own child JVM. So to make efficient use of your hardware, you need to achieve greater parallelism, which means layering a multi-threaded model on top of Hadoop.

-- Ken</description>
		<content:encoded><![CDATA[<p>Hi Xiaomeng,</p>
<p>Yes, MR will parallelize this wait time. But you can&#8217;t run many MR tasks/server, as each one runs in its own child JVM. So to make efficient use of your hardware, you need to achieve greater parallelism, which means layering a multi-threaded model on top of Hadoop.</p>
<p>&#8211; Ken</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: xiaomeng wan</title>
		<link>http://ken-blog.krugler.org/2009/12/12/why-fetching-web-pages-doesnt-map-well-to-map-reduce/#comment-559</link>
		<dc:creator>xiaomeng wan</dc:creator>
		<pubDate>Mon, 09 Aug 2010 20:54:30 +0000</pubDate>
		<guid isPermaLink="false">http://ken-blog.krugler.org/?p=197#comment-559</guid>
		<description>Regarding the &quot;But since most of the time during a fetch is spent waiting for the server to respond, you’re getting very low utilization of your available hardware &amp; bandwidth.&quot; Isn&#039;t the waiting time parallelized by MR? so it ends up faster?</description>
		<content:encoded><![CDATA[<p>Regarding the &#8220;But since most of the time during a fetch is spent waiting for the server to respond, you’re getting very low utilization of your available hardware &amp; bandwidth.&#8221; Isn&#8217;t the waiting time parallelized by MR? so it ends up faster?</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: kkrugler</title>
		<link>http://ken-blog.krugler.org/2009/12/12/why-fetching-web-pages-doesnt-map-well-to-map-reduce/#comment-500</link>
		<dc:creator>kkrugler</dc:creator>
		<pubDate>Mon, 01 Feb 2010 17:25:54 +0000</pubDate>
		<guid isPermaLink="false">http://ken-blog.krugler.org/?p=197#comment-500</guid>
		<description>Hi Otis,

My comment about parallelism being limited by the number of reducers is based on a design goal of trying to use straight MR, without the complexity of additional threading support.

Since you don&#039;t get sufficient parallelism via just the number of reducers, you have to deal with the added complexity of a multi-threaded reducer. Which is what Nutch and Bixo both do - but it&#039;s not pretty :)</description>
		<content:encoded><![CDATA[<p>Hi Otis,</p>
<p>My comment about parallelism being limited by the number of reducers is based on a design goal of trying to use straight MR, without the complexity of additional threading support.</p>
<p>Since you don&#8217;t get sufficient parallelism via just the number of reducers, you have to deal with the added complexity of a multi-threaded reducer. Which is what Nutch and Bixo both do &#8211; but it&#8217;s not pretty <img src='http://s.wordpress.com/wp-includes/images/smilies/icon_smile.gif' alt=':)' class='wp-smiley' /> </p>
]]></content:encoded>
	</item>
	<item>
		<title>By: sematext</title>
		<link>http://ken-blog.krugler.org/2009/12/12/why-fetching-web-pages-doesnt-map-well-to-map-reduce/#comment-499</link>
		<dc:creator>sematext</dc:creator>
		<pubDate>Mon, 01 Feb 2010 17:04:08 +0000</pubDate>
		<guid isPermaLink="false">http://ken-blog.krugler.org/?p=197#comment-499</guid>
		<description>Regarding the &quot;the maximum amount of parallelization would be equal to the number of reducers, which typically is something close to the number of ccores (servers * cores/server). So on a 10 server cluster w/dual cores, you’d have 20 threads active.&quot; piece:

Do number from this example match what you&#039;ve observed with Nutch?  I don&#039;t have an example of Nutch log handy, but it does include the number of active threads, so I&#039;m wondering if in your experience it really matches the above theory(?)?</description>
		<content:encoded><![CDATA[<p>Regarding the &#8220;the maximum amount of parallelization would be equal to the number of reducers, which typically is something close to the number of ccores (servers * cores/server). So on a 10 server cluster w/dual cores, you’d have 20 threads active.&#8221; piece:</p>
<p>Do number from this example match what you&#8217;ve observed with Nutch?  I don&#8217;t have an example of Nutch log handy, but it does include the number of active threads, so I&#8217;m wondering if in your experience it really matches the above theory(?)?</p>
]]></content:encoded>
	</item>
</channel>
</rss>
