<?xml version="1.0" encoding="UTF-8"?><rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
		>
<channel>
	<title>Comments on: Grep the web from the web</title>
	<atom:link href="http://www.sampablokuper.com/2008/05/11/grep-the-web-from-the-web/feed/" rel="self" type="application/rss+xml" />
	<link>http://www.sampablokuper.com/2008/05/11/grep-the-web-from-the-web/</link>
	<description>Just another WordPress weblog</description>
	<lastBuildDate>Fri, 13 Aug 2010 21:49:52 +0000</lastBuildDate>
	<generator>http://wordpress.org/?v=2.9.2</generator>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
		<item>
		<title>By: sampablokuper</title>
		<link>http://www.sampablokuper.com/2008/05/11/grep-the-web-from-the-web/comment-page-1/#comment-3958</link>
		<dc:creator>sampablokuper</dc:creator>
		<pubDate>Sun, 07 Jun 2009 17:54:33 +0000</pubDate>
		<guid isPermaLink="false">http://www.sampablokuper.com/blog/?p=9#comment-3958</guid>
		<description>&lt;p&gt;Hi Greg, thanks for the interest. I haven&#039;t had a chance to do much at all with it since my last comment above. I&#039;ll update this thread next time I have a chance to work on it.&lt;/p&gt;
</description>
		<content:encoded><![CDATA[<p>Hi Greg, thanks for the interest. I haven&#8217;t had a chance to do much at all with it since my last comment above. I&#8217;ll update this thread next time I have a chance to work on it.</p>]]></content:encoded>
	</item>
	<item>
		<title>By: Greg Elin</title>
		<link>http://www.sampablokuper.com/2008/05/11/grep-the-web-from-the-web/comment-page-1/#comment-2707</link>
		<dc:creator>Greg Elin</dc:creator>
		<pubDate>Fri, 22 May 2009 15:36:19 +0000</pubDate>
		<guid isPermaLink="false">http://www.sampablokuper.com/blog/?p=9#comment-2707</guid>
		<description>&lt;p&gt;I&#039;m thinking about a tool that would be for uploaded simple tabular files as well. The purpose would be to introduce journalists, citizen journalists, and others to the power searching for and filtering for multiple things with grep.&lt;/p&gt;

&lt;p&gt;Would love to know the progress of this project.&lt;/p&gt;

&lt;p&gt;Greg Elin
http://twitter.com/gregelin&lt;/p&gt;
</description>
		<content:encoded><![CDATA[<p>I&#8217;m thinking about a tool that would be for uploaded simple tabular files as well. The purpose would be to introduce journalists, citizen journalists, and others to the power searching for and filtering for multiple things with grep.</p>

<p>Would love to know the progress of this project.</p>

<p>Greg Elin
<a href="http://twitter.com/gregelin" rel="nofollow">http://twitter.com/gregelin</a></p>]]></content:encoded>
	</item>
	<item>
		<title>By: sampablokuper</title>
		<link>http://www.sampablokuper.com/2008/05/11/grep-the-web-from-the-web/comment-page-1/#comment-5</link>
		<dc:creator>sampablokuper</dc:creator>
		<pubDate>Thu, 05 Jun 2008 01:37:42 +0000</pubDate>
		<guid isPermaLink="false">http://www.sampablokuper.com/blog/?p=9#comment-5</guid>
		<description>&lt;p&gt;Sweet, I&#039;d been vaguely looking for something like YubNub (i.e. a &quot;URL command line for the web OS&quot;, &lt;a href=http://jonaquino.blogspot.com/2005/06/yubnub-my-entry-for-rails-day-24-hour.html&quot; rel=&quot;nofollow&quot;&gt;as John Aquino puts it&lt;/a&gt;) for a while. cURL and wget are great, but you need access to a net-facing linux account to use them. Having access to command-line utilities for the web &lt;i&gt;from&lt;/i&gt; the web would be really handy in lots of situations. Grep is just one case, but since it&#039;s the one that&#039;s been bugging me most, it&#039;s the one I&#039;m tackling.&lt;/p&gt;

&lt;p&gt;As for progress on greptheweb, I&#039;d hoped to get a Trac site up to begin making inroads visible. Having a wiki would be handy for jotting down ideas for getting on with, and making code web-browsable is easier on anyone interested in looking at it than forcing them to check it out over svn. If I still can&#039;t get Trac running on Dreamhost on the next attempt, I think I&#039;ll either fork out for an account with WebFaction, VPSville or some such, or I&#039;ll set up code &amp; wiki hosting via Google code or similar.&lt;/p&gt;

&lt;p&gt;I&#039;ve parsed out a list of URLs from the Open Directory Project&#039;s massive RDF index, and made some calculations about how large a minimally-greppable local copy of all the pages from those URLs would be: ~500GB. In other words, I&#039;m not going to be able to fit the index on my Dreamhost account if I want to have any breathing room to speak of (and I do want breathing room, for photos and other bits and pieces). I picked up a cheap HP server recently, and put a couple of big SATA drives in it, so my plan is to use this for building the index on, assuming I can learn enough sysadmin skills to set it up with Debian or FreeBSD and indexing the web from behind my router. I&#039;m a little concerned that my ISP will think I&#039;m trying to overwhelm their DNS server if I&#039;m sending out 3600 requests an hour, so this could be another hurdle. I might have to implement a round-robin system to query multiple DNS servers in sequence, or something.&lt;/p&gt;

&lt;p&gt;Then there&#039;s the question of how to store, and grep, a million or more web pages. So far, I&#039;m thinking of using Nutch to build the index, because that&#039;s exactly what it&#039;s designed for. But Nutch doesn&#039;t have the capability to query that index with regexps once it&#039;s built. Also, regexp queries are expensive. So I think a hybrid approach might work:&lt;/p&gt;

&lt;ul&gt;&lt;li&gt;First, parse the submitted regexp to see if it contains any mandatory strings (in practice, many probably will). If it does, use Nutch&#039;s built-in Lucene engine to search for entries with those strings (i.e. to filter out those that don&#039;t).&lt;/li&gt;
&lt;li&gt;Then, run the regexp match on the remaining entries. How? Er, at this point it looks like I might have to write a Nutch plug-in...&lt;/li&gt;&lt;/ul&gt;
</description>
		<content:encoded><![CDATA[<p>Sweet, I&#8217;d been vaguely looking for something like YubNub (i.e. a &#8220;URL command line for the web OS&#8221;, &lt;a href=http://jonaquino.blogspot.com/2005/06/yubnub-my-entry-for-rails-day-24-hour.html&#8221; rel=&#8221;nofollow&#8221;>as John Aquino puts it) for a while. cURL and wget are great, but you need access to a net-facing linux account to use them. Having access to command-line utilities for the web <i>from</i> the web would be really handy in lots of situations. Grep is just one case, but since it&#8217;s the one that&#8217;s been bugging me most, it&#8217;s the one I&#8217;m tackling.</p>

<p>As for progress on greptheweb, I&#8217;d hoped to get a Trac site up to begin making inroads visible. Having a wiki would be handy for jotting down ideas for getting on with, and making code web-browsable is easier on anyone interested in looking at it than forcing them to check it out over svn. If I still can&#8217;t get Trac running on Dreamhost on the next attempt, I think I&#8217;ll either fork out for an account with WebFaction, VPSville or some such, or I&#8217;ll set up code &amp; wiki hosting via Google code or similar.</p>

<p>I&#8217;ve parsed out a list of URLs from the Open Directory Project&#8217;s massive RDF index, and made some calculations about how large a minimally-greppable local copy of all the pages from those URLs would be: ~500GB. In other words, I&#8217;m not going to be able to fit the index on my Dreamhost account if I want to have any breathing room to speak of (and I do want breathing room, for photos and other bits and pieces). I picked up a cheap HP server recently, and put a couple of big SATA drives in it, so my plan is to use this for building the index on, assuming I can learn enough sysadmin skills to set it up with Debian or FreeBSD and indexing the web from behind my router. I&#8217;m a little concerned that my ISP will think I&#8217;m trying to overwhelm their DNS server if I&#8217;m sending out 3600 requests an hour, so this could be another hurdle. I might have to implement a round-robin system to query multiple DNS servers in sequence, or something.</p>

<p>Then there&#8217;s the question of how to store, and grep, a million or more web pages. So far, I&#8217;m thinking of using Nutch to build the index, because that&#8217;s exactly what it&#8217;s designed for. But Nutch doesn&#8217;t have the capability to query that index with regexps once it&#8217;s built. Also, regexp queries are expensive. So I think a hybrid approach might work:</p>

<ul><li>First, parse the submitted regexp to see if it contains any mandatory strings (in practice, many probably will). If it does, use Nutch&#8217;s built-in Lucene engine to search for entries with those strings (i.e. to filter out those that don&#8217;t).</li>
<li>Then, run the regexp match on the remaining entries. How? Er, at this point it looks like I might have to write a Nutch plug-in&#8230;</li></ul>]]></content:encoded>
	</item>
	<item>
		<title>By: Mel</title>
		<link>http://www.sampablokuper.com/2008/05/11/grep-the-web-from-the-web/comment-page-1/#comment-4</link>
		<dc:creator>Mel</dc:creator>
		<pubDate>Thu, 05 Jun 2008 01:03:19 +0000</pubDate>
		<guid isPermaLink="false">http://www.sampablokuper.com/blog/?p=9#comment-4</guid>
		<description>&lt;p&gt;Oh, I like this. I&#039;ll keep the idea in the back of my head and see what germinates.&lt;/p&gt;

&lt;p&gt;Their back-end probably wouldn&#039;t be very useful, but I immediately thought of the http://www.yubnub.org interface when I read this post.&lt;/p&gt;
</description>
		<content:encoded><![CDATA[<p>Oh, I like this. I&#8217;ll keep the idea in the back of my head and see what germinates.</p>

<p>Their back-end probably wouldn&#8217;t be very useful, but I immediately thought of the <a href="http://www.yubnub.org" rel="nofollow">http://www.yubnub.org</a> interface when I read this post.</p>]]></content:encoded>
	</item>
</channel>
</rss>
