I like grep. Grep lets you search within files, using regular expressions. Usually, you can even use wildcards to specify which files grep should search within. This means that when you ask grep to find something, you can give it just about as much information as you know, reducing the chances that it will spend a long time looking in the wrong place(s) or that it will return the wrong result(s).
But why can’t I do this on the web? Well, maybe it is possible, but I haven’t yet found a web site that offers to search the web in a way that lets me use wildcards while specifying which URLs to search at and that lets me use regular expression syntax to specify what to search for. (If you know of such a site, please post it here.)
In the absence of such a site, I’d like to propose creating one. So I’ve registered www.greptheweb.com and I’d like the community’s help to build a site to put there.
The site should, I think:
- Have a front page with a form that has two fields:
- Where to search. Here users can enter URLs with Perl regex syntax.
- What to search for. Here users can specify strings they’re looking for, also using Perl’s regex syntax.
- User’s should perhaps be assisted in entering their queries by means of a RegexBuddy-style helper, maybe implemented with AJAX techniques.
- Have a help or FAQ page.
- Have results pages that are dynamically generated with GET requests. These would be returned in response to a form submission, but would also be generated in response to a direct request, e.g. from a bookmark.
The back end of the site is the hardest part, of course. Web search engines are difficult to implement well. Scaling is an issue. I’ve never built a search engine; I’ve never written a grep-like program. Still, I have some ideas about how the back end should work.
- It could just use the Alexa Web Search’s “grep the web” service, templating the results in HTML and returning them to the user; or
- It could be built from scratch. If the latter, then:
- It should, to begin with, index pages in a simplistic fashion. That is, it should include a robot that can find pages with straightforward URLs like http://www.python.org . Eventually the robot should be made sophisticated enough to find all publicly-accessible web pages.
Oh, I like this. I’ll keep the idea in the back of my head and see what germinates.
Their back-end probably wouldn’t be very useful, but I immediately thought of the http://www.yubnub.org interface when I read this post.
Sweet, I’d been vaguely looking for something like YubNub (i.e. a “URL command line for the web OS”, <a href=http://jonaquino.blogspot.com/2005/06/yubnub-my-entry-for-rails-day-24-hour.html” rel=”nofollow”>as John Aquino puts it) for a while. cURL and wget are great, but you need access to a net-facing linux account to use them. Having access to command-line utilities for the web from the web would be really handy in lots of situations. Grep is just one case, but since it’s the one that’s been bugging me most, it’s the one I’m tackling.
As for progress on greptheweb, I’d hoped to get a Trac site up to begin making inroads visible. Having a wiki would be handy for jotting down ideas for getting on with, and making code web-browsable is easier on anyone interested in looking at it than forcing them to check it out over svn. If I still can’t get Trac running on Dreamhost on the next attempt, I think I’ll either fork out for an account with WebFaction, VPSville or some such, or I’ll set up code & wiki hosting via Google code or similar.
I’ve parsed out a list of URLs from the Open Directory Project’s massive RDF index, and made some calculations about how large a minimally-greppable local copy of all the pages from those URLs would be: ~500GB. In other words, I’m not going to be able to fit the index on my Dreamhost account if I want to have any breathing room to speak of (and I do want breathing room, for photos and other bits and pieces). I picked up a cheap HP server recently, and put a couple of big SATA drives in it, so my plan is to use this for building the index on, assuming I can learn enough sysadmin skills to set it up with Debian or FreeBSD and indexing the web from behind my router. I’m a little concerned that my ISP will think I’m trying to overwhelm their DNS server if I’m sending out 3600 requests an hour, so this could be another hurdle. I might have to implement a round-robin system to query multiple DNS servers in sequence, or something.
Then there’s the question of how to store, and grep, a million or more web pages. So far, I’m thinking of using Nutch to build the index, because that’s exactly what it’s designed for. But Nutch doesn’t have the capability to query that index with regexps once it’s built. Also, regexp queries are expensive. So I think a hybrid approach might work:
I’m thinking about a tool that would be for uploaded simple tabular files as well. The purpose would be to introduce journalists, citizen journalists, and others to the power searching for and filtering for multiple things with grep.
Would love to know the progress of this project.
Greg Elin http://twitter.com/gregelin
Hi Greg, thanks for the interest. I haven’t had a chance to do much at all with it since my last comment above. I’ll update this thread next time I have a chance to work on it.