<?xml version="1.0" encoding="UTF-8"?><rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:georss="http://www.georss.org/georss" xmlns:geo="http://www.w3.org/2003/01/geo/wgs84_pos#" xmlns:media="http://search.yahoo.com/mrss/"
		>
<channel>
	<title>Comments on: Performance problems with vertical/focused web crawling</title>
	<atom:link href="http://ken-blog.krugler.org/2009/05/19/performance-problems-with-verticalfocused-web-crawling/feed/" rel="self" type="application/rss+xml" />
	<link>http://ken-blog.krugler.org/2009/05/19/performance-problems-with-verticalfocused-web-crawling/</link>
	<description></description>
	<lastBuildDate>Mon, 19 Dec 2011 02:06:37 +0000</lastBuildDate>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.com/</generator>
	<item>
		<title>By: mafri.ws</title>
		<link>http://ken-blog.krugler.org/2009/05/19/performance-problems-with-verticalfocused-web-crawling/#comment-702</link>
		<dc:creator><![CDATA[mafri.ws]]></dc:creator>
		<pubDate>Tue, 04 Oct 2011 10:05:01 +0000</pubDate>
		<guid isPermaLink="false">http://ken-blog.krugler.org/?p=146#comment-702</guid>
		<description><![CDATA[&lt;strong&gt;mafri.ws...&lt;/strong&gt;

[...]Performance problems with vertical/focused web crawling &#171; Ken&#039;s Techno Tidbits[...]...]]></description>
		<content:encoded><![CDATA[<p><strong>mafri.ws&#8230;</strong></p>
<p>[...]Performance problems with vertical/focused web crawling &laquo; Ken&#039;s Techno Tidbits[...]&#8230;</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Nutch fetch performance - Nutch Tutorial</title>
		<link>http://ken-blog.krugler.org/2009/05/19/performance-problems-with-verticalfocused-web-crawling/#comment-624</link>
		<dc:creator><![CDATA[Nutch fetch performance - Nutch Tutorial]]></dc:creator>
		<pubDate>Sat, 09 Oct 2010 07:19:53 +0000</pubDate>
		<guid isPermaLink="false">http://ken-blog.krugler.org/?p=146#comment-624</guid>
		<description><![CDATA[[...] http://ken-blog.krugler.org/2009/05/19/performance-problems-with-verticalfocused-web-crawling/ [...]]]></description>
		<content:encoded><![CDATA[<p>[...] <a href="http://ken-blog.krugler.org/2009/05/19/performance-problems-with-verticalfocused-web-crawling/" rel="nofollow">http://ken-blog.krugler.org/2009/05/19/performance-problems-with-verticalfocused-web-crawling/</a> [...]</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: kkrugler</title>
		<link>http://ken-blog.krugler.org/2009/05/19/performance-problems-with-verticalfocused-web-crawling/#comment-502</link>
		<dc:creator><![CDATA[kkrugler]]></dc:creator>
		<pubDate>Fri, 05 Feb 2010 23:22:09 +0000</pubDate>
		<guid isPermaLink="false">http://ken-blog.krugler.org/?p=146#comment-502</guid>
		<description><![CDATA[Hi Otis,

I haven&#039;t been closely tracking Nutch patches, though I see that there have been improvements to how the plugins can communicate information - which was one of the major headaches previously.

As to a comparison, I think the recent post by Stefano Cherchi on the Nutch list is a good example of where Bixo is much easier to use than Nutch. I had to deal with almost exactly the same use case he describes, where you have top-level pages that have links to the real content, but you have to crawl (and not index) these top pages first.

In Bixo I created a workflow that took top level URLs in, fetched them, used a modified parser to extract the links, then fed these into a second fetch pipe, which in turn fed the results into the parser.

So I wound up with one very simple (conceptually) Cascading workflow, which was very reliable...e.g. no error-prone editing of config files in between runs.

But it would be interesting to do a side-by-side comparison of Bixo &amp; Nutch for some vertical crawl project. If you have something in mind, I&#039;d be interested in giving that a try.

-- Ken]]></description>
		<content:encoded><![CDATA[<p>Hi Otis,</p>
<p>I haven&#8217;t been closely tracking Nutch patches, though I see that there have been improvements to how the plugins can communicate information &#8211; which was one of the major headaches previously.</p>
<p>As to a comparison, I think the recent post by Stefano Cherchi on the Nutch list is a good example of where Bixo is much easier to use than Nutch. I had to deal with almost exactly the same use case he describes, where you have top-level pages that have links to the real content, but you have to crawl (and not index) these top pages first.</p>
<p>In Bixo I created a workflow that took top level URLs in, fetched them, used a modified parser to extract the links, then fed these into a second fetch pipe, which in turn fed the results into the parser.</p>
<p>So I wound up with one very simple (conceptually) Cascading workflow, which was very reliable&#8230;e.g. no error-prone editing of config files in between runs.</p>
<p>But it would be interesting to do a side-by-side comparison of Bixo &amp; Nutch for some vertical crawl project. If you have something in mind, I&#8217;d be interested in giving that a try.</p>
<p>&#8211; Ken</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Otis Gospodnetic</title>
		<link>http://ken-blog.krugler.org/2009/05/19/performance-problems-with-verticalfocused-web-crawling/#comment-501</link>
		<dc:creator><![CDATA[Otis Gospodnetic]]></dc:creator>
		<pubDate>Mon, 01 Feb 2010 17:37:30 +0000</pubDate>
		<guid isPermaLink="false">http://ken-blog.krugler.org/?p=146#comment-501</guid>
		<description><![CDATA[Ken, at this point, with various semi-recent patches applied to Nutch to address some of these long-tail issues, would you say Bixo is still a better tool to use for vertical crawls?  If so, what are the &quot;remaining&quot; aspects of Bixo that Nutch still misses to be equally good for such crawls?]]></description>
		<content:encoded><![CDATA[<p>Ken, at this point, with various semi-recent patches applied to Nutch to address some of these long-tail issues, would you say Bixo is still a better tool to use for vertical crawls?  If so, what are the &#8220;remaining&#8221; aspects of Bixo that Nutch still misses to be equally good for such crawls?</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: kkrugler</title>
		<link>http://ken-blog.krugler.org/2009/05/19/performance-problems-with-verticalfocused-web-crawling/#comment-474</link>
		<dc:creator><![CDATA[kkrugler]]></dc:creator>
		<pubDate>Sat, 12 Dec 2009 21:53:06 +0000</pubDate>
		<guid isPermaLink="false">http://ken-blog.krugler.org/?p=146#comment-474</guid>
		<description><![CDATA[Just blogged about this issue - see http://ken-blog.krugler.org/2009/12/12/why-fetching-web-pages-doesnt-map-well-to-map-reduce/]]></description>
		<content:encoded><![CDATA[<p>Just blogged about this issue &#8211; see <a href="http://ken-blog.krugler.org/2009/12/12/why-fetching-web-pages-doesnt-map-well-to-map-reduce/" rel="nofollow">http://ken-blog.krugler.org/2009/12/12/why-fetching-web-pages-doesnt-map-well-to-map-reduce/</a></p>
]]></content:encoded>
	</item>
	<item>
		<title>By: kkrugler</title>
		<link>http://ken-blog.krugler.org/2009/05/19/performance-problems-with-verticalfocused-web-crawling/#comment-473</link>
		<dc:creator><![CDATA[kkrugler]]></dc:creator>
		<pubDate>Sat, 12 Dec 2009 21:38:24 +0000</pubDate>
		<guid isPermaLink="false">http://ken-blog.krugler.org/?p=146#comment-473</guid>
		<description><![CDATA[Hi Fuad,

I don&#039;t understand your question about IScoreGenerator and IGroupingKeyGenerator - maybe post that to the bixo-dev mailing list with more details?

As to the second suggestion, trying to better use the Hadoop map-reduce support via a smarter partitioner is something that we spent time thinking about, but I don&#039;t see how it could possibly work. I&#039;ll write a blog post about that soon, to address the reasons why in greater depth.

-- Ken]]></description>
		<content:encoded><![CDATA[<p>Hi Fuad,</p>
<p>I don&#8217;t understand your question about IScoreGenerator and IGroupingKeyGenerator &#8211; maybe post that to the bixo-dev mailing list with more details?</p>
<p>As to the second suggestion, trying to better use the Hadoop map-reduce support via a smarter partitioner is something that we spent time thinking about, but I don&#8217;t see how it could possibly work. I&#8217;ll write a blog post about that soon, to address the reasons why in greater depth.</p>
<p>&#8211; Ken</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Fuad Efendi</title>
		<link>http://ken-blog.krugler.org/2009/05/19/performance-problems-with-verticalfocused-web-crawling/#comment-472</link>
		<dc:creator><![CDATA[Fuad Efendi]]></dc:creator>
		<pubDate>Sat, 12 Dec 2009 14:39:05 +0000</pubDate>
		<guid isPermaLink="false">http://ken-blog.krugler.org/?p=146#comment-472</guid>
		<description><![CDATA[Hi Ken,

What about IScoreGenerator and IGroupingKeyGenerator in BIXO, it can be done &quot;up to 1000 URLs from same domain in a single iteration&quot; easily, so that we will have a tail at the very last iteration. However, we may have different crawl-depths on a single fetch step (one of domains have 2000 links from 100 pages, and another one 1000 links from 100 pages).

Another solution would be to avoid sleeping Threads at all. We need partitioner (Grouping Key) such a way that first N domains go to host 1, second N to host 2, etc; single page from a single domain for each iteration (could be 10 pages with Keep-Alive). Different architecture; we don&#039;t have Queue of URLs for same domain in this case, and threads should not be worried about crawl-delay at all, it will be few minutes in practice... and... 

...this shows the main problem of Cascading-type of architecture: resource usage is not optimized.

Imagine 100 Hadoop hosts sharing 100Mbps network connectivity. Fetch is I/O bound, 100% network wait, and 0.1% CPU time.

After all, we spend some time on Fetch, and some another time on Parse, and during Parse our network usage is 0 and CPU 100%, but during fetch network 100% and CPU 0%.

As I see from Google and even Yahoo robot logs they are constantly crawling my website without few-hours delay on parsing, indexing, preparing fetch list, and etc.

How to distribute load evenly?]]></description>
		<content:encoded><![CDATA[<p>Hi Ken,</p>
<p>What about IScoreGenerator and IGroupingKeyGenerator in BIXO, it can be done &#8220;up to 1000 URLs from same domain in a single iteration&#8221; easily, so that we will have a tail at the very last iteration. However, we may have different crawl-depths on a single fetch step (one of domains have 2000 links from 100 pages, and another one 1000 links from 100 pages).</p>
<p>Another solution would be to avoid sleeping Threads at all. We need partitioner (Grouping Key) such a way that first N domains go to host 1, second N to host 2, etc; single page from a single domain for each iteration (could be 10 pages with Keep-Alive). Different architecture; we don&#8217;t have Queue of URLs for same domain in this case, and threads should not be worried about crawl-delay at all, it will be few minutes in practice&#8230; and&#8230; </p>
<p>&#8230;this shows the main problem of Cascading-type of architecture: resource usage is not optimized.</p>
<p>Imagine 100 Hadoop hosts sharing 100Mbps network connectivity. Fetch is I/O bound, 100% network wait, and 0.1% CPU time.</p>
<p>After all, we spend some time on Fetch, and some another time on Parse, and during Parse our network usage is 0 and CPU 100%, but during fetch network 100% and CPU 0%.</p>
<p>As I see from Google and even Yahoo robot logs they are constantly crawling my website without few-hours delay on parsing, indexing, preparing fetch list, and etc.</p>
<p>How to distribute load evenly?</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: kkrugler</title>
		<link>http://ken-blog.krugler.org/2009/05/19/performance-problems-with-verticalfocused-web-crawling/#comment-293</link>
		<dc:creator><![CDATA[kkrugler]]></dc:creator>
		<pubDate>Sun, 24 May 2009 07:53:59 +0000</pubDate>
		<guid isPermaLink="false">http://ken-blog.krugler.org/?p=146#comment-293</guid>
		<description><![CDATA[Hi Otis,

I missed that Roger was setting things up to only have one URL per host...thanks.

Re fetch termination - terminating slow servers definitely helps. I&#039;m dealing with a slightly different problem now, where the partner crawl has a big exponential decay in the URLs/domain graph. A few domains have 10K URLs, while most of the 50K unique domains have but one.

Since it&#039;s a &quot;partner&quot; crawl, I&#039;m adding a PartnerFetchPolicy that adaptively cranks down the crawl delay (to some minimum value), with the calculated value based on trying to crawl all of the ULRs in the target fetch duration. I&#039;ll post on how well that works soon.]]></description>
		<content:encoded><![CDATA[<p>Hi Otis,</p>
<p>I missed that Roger was setting things up to only have one URL per host&#8230;thanks.</p>
<p>Re fetch termination &#8211; terminating slow servers definitely helps. I&#8217;m dealing with a slightly different problem now, where the partner crawl has a big exponential decay in the URLs/domain graph. A few domains have 10K URLs, while most of the 50K unique domains have but one.</p>
<p>Since it&#8217;s a &#8220;partner&#8221; crawl, I&#8217;m adding a PartnerFetchPolicy that adaptively cranks down the crawl delay (to some minimum value), with the calculated value based on trying to crawl all of the ULRs in the target fetch duration. I&#8217;ll post on how well that works soon.</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Otis Gospodnetic</title>
		<link>http://ken-blog.krugler.org/2009/05/19/performance-problems-with-verticalfocused-web-crawling/#comment-291</link>
		<dc:creator><![CDATA[Otis Gospodnetic]]></dc:creator>
		<pubDate>Sun, 24 May 2009 04:06:09 +0000</pubDate>
		<guid isPermaLink="false">http://ken-blog.krugler.org/?p=146#comment-291</guid>
		<description><![CDATA[I&#039;ve definitely observed the slow long tail.  As a matter of fact, I think I submitted some patches to deal with this a while back.  If I recall correctly, I think I added something that had logic that at some point simply skipped a bunch of URLs in order to avoid getting stuck forever with a small number of slow hosts.  Aha, yes, I think I had something like the minimal fetch rate setting.  I&#039;d keep track of the fetch rate and if I detected a slow host, I&#039;d skip all its remaining URLs.  I *think* that&#039;s in Nutch 1.0...

As for Fetcher2 slowness, I don&#039;t think it has to do with what you said.  I think so because I see Roger Dunk used generate.max.per.host = 1 in https://issues.apache.org/jira/browse/NUTCH-721 .]]></description>
		<content:encoded><![CDATA[<p>I&#8217;ve definitely observed the slow long tail.  As a matter of fact, I think I submitted some patches to deal with this a while back.  If I recall correctly, I think I added something that had logic that at some point simply skipped a bunch of URLs in order to avoid getting stuck forever with a small number of slow hosts.  Aha, yes, I think I had something like the minimal fetch rate setting.  I&#8217;d keep track of the fetch rate and if I detected a slow host, I&#8217;d skip all its remaining URLs.  I *think* that&#8217;s in Nutch 1.0&#8230;</p>
<p>As for Fetcher2 slowness, I don&#8217;t think it has to do with what you said.  I think so because I see Roger Dunk used generate.max.per.host = 1 in <a href="https://issues.apache.org/jira/browse/NUTCH-721" rel="nofollow">https://issues.apache.org/jira/browse/NUTCH-721</a> .</p>
]]></content:encoded>
	</item>
</channel>
</rss>

