IndustryHelp LookSmart Crawl the Web

Help LookSmart Crawl the Web

LookSmart is taking a new approach to discovering web content, offering a free downloadable screensaver program that also crawls the web when your computer is idle.

LookSmart is taking a new approach to discovering web content, offering a free downloadable screensaver program that also crawls the web when your computer is idle.

A longer, more detailed version of this article is
available to Search Engine Watch members.
Click here to learn more about becoming a member

The program is Grub, the distributed crawling service that LookSmart bought in January for $1.4 million.

Most crawlers are centralized, run from each search engine’s data centers. Grub, on the other hand, runs from the computers of anyone who has downloaded and installed the Grub client. LookSmart plans to use the information gathered by Grub crawlers to supplement the centralized crawls run by its Wisenut search engine.

“Fundamentally, the first problem we’re trying to solve with our acquisition of Grub is that we know about many more documents than we can actually retrieve and analyze right now,” said Peter Adams, chief technology officer of LookSmart. “We know about over 10 billion URLs right now, and we see that trend growing in terms of web pages that are being added.”

Most search engines crawl many more documents than they actually index. Even culling duplicate pages, spam or otherwise inappropriate content, search engines have a hard time keeping pace with the constantly changing nature of the web.

This causes problems with the freshness of search engine indexes. While all of the major search engines update at least a portion of their indexes on a daily basis, most settle for anywhere from two weeks to a couple of months to completely refresh their databases.

Crawling more frequently, while technically possible, has its downsides, including greater costs and greater bandwidth consumption. Grub’s distributed approach to crawling can alleviate some of these downsides, according to Adams.

“Our first objective is to build a community of distributed web crawlers that will allow us to crawl all of the web documents every day,” Adams said. “Not necessarily to index them all, but to assemble a database of information about them — what’s new, what’s dead, what’s changed.”

The Grub crawler visits a list of essentially random URLs sent down from a central server. It retrieves pages and analyzes them, creating a “fingerprint” of a document, a unique sort of code that describes the document. Each time a page is crawled, Grub compares the new code to the old code. If it’s different, that signals there’s been a change to the page.

“Instead of crawling and send everything back, we only have the crawlers send back changed information,” said Adams. This intermediate analysis of a page is impossible for centralized crawlers to perform, since they must retrieve a page and store it in the search engine’s database before any analysis can be performed.

LookSmart believes that this distributed approach to crawling will be vital to coping with the growth of the Internet, and assuring that search engines continue to produce relevant results.

“If you look back over the past ten years of search engines, beyond five years ago what you’re really seeing is few large servers working on a smallish index,” said Andre Stechert, Grub’s director of technology.

“A little while ago, there was something called cluster computing that came along, and Google essentially capitalized on this in a big bad way. They took existing information retrieval algorithms and put them on this cheap computing model, which fundamentally changed search,” said Stechert.

Whereas Google uses clusters of thousands of computers, Stechert envisions yet another leap forward in search engine technology. Distributed “grid” programs like Grub will be hosted not on thousands of computers but millions.

“Google asked the question, ‘what happens when you have 10,000 computers.’ We’re asking, ‘what happens when you have a million,'” said Stechert. “This is going to yield another revolution in the quality of search results.”

The Grub client is easy to download and install. You have full control over its behavior — when it runs, how much bandwidth it consumes, and so on. In my tests, it crawled dozens of URLs in minutes over my cable modem connection without interfering with any of the other applications running on my computer.

It’s fascinating to watch the crawling process. The standard Grub interface shows you two graphs, displaying your bandwidth “history” and the number of URLs crawled per minute. Other statistics display information about the current crawl — pages that have changed, remain unchanged, are unreachable, and so on.

The screensaver is a visualization that graphically displays the crawling process. You can also switch to a view that scrolls the list of URLs as they’re being crawled.

You have no control over what’s crawled, with one exception, that I’ll talk more about. Nonetheless, its fascinating to see the display of URLs from all over the world — most of them unfamiliar. It reminds me of the early days of the web when random web page generators were popular.

If you own or operate your own web site, Grub will allow you to run a “local” crawl of your site every night. This is a great way to ensure that all of the content on your site gets crawled. For large sites, it will also cut down on some of the bandwidth consumption, since Grub compresses all data it sends back to its servers, by a factor of up to 20:1.

Why should you help LookSmart index the web? The altruistic reason is that it will help them broaden their coverage of the web, and potentially improve the relevance of search results. If Grub catches on, it’s likely to spur similar efforts by other search engines.

Grub also keeps stats for each user. You can see how much your client has crawled, and compare your “ranking” with other Grub users.

But the best reason, at least to me, is that watching a crawler in action is fascinating. It allows you to directly observe a process that’s normally hidden away in the black boxes we call search engines. Bottom line: it’s a heck of a lot of fun.

Grub
http://www.grub.org

Grub Frequently Asked Questions
http://www.grub.org/html/help.php?op=main-faq

Patent Wars!
The Search Engine Report, Oct. 6, 1997
https://www.searchenginewatch.com/sereport/97/10-patent.html

LookSmart isn’t the first search engine to use distributed computing. All the way back in 1997, Infoseek (technology now owned by Disney) was issued a patent for “distributed searching,” a sort of federated meta-search process. Google is also experimenting with distributed computing with its Google compute project.

Google’s New High Protein Diet
SearchDay, Mar. 25, 2002
https://www.searchenginewatch.com/searchday/02/sd0325-googlecom.html
Google is harnessing the collective computing power of its users to help model complex proteins, a project that could lead to the development of cures for Alzheimer’s, cancer, AIDS and other diseases.

A longer, more detailed version of this article is
available to Search Engine Watch members.
Click here to learn more about becoming a member

Search Headlines

NOTE: Article links often change. In case of a bad link, use the publication’s search facility, which most have, and search for the headline.

Online search engines news
Being Tops with Your Users and the Search Engines (Part 1)
High Rankings Apr 3 2003 2:26PM GMT
Online portals news
Building a Service Portal
line56 Apr 3 2003 2:21PM GMT
Online search engines news
Web users flock to BBC in search of news
Netimperative Apr 3 2003 12:31PM GMT
AltaVista… Google… Microsoft? Where does the future of web search lie?
Silicon.com Apr 3 2003 11:03AM GMT
Microsoft Tries to Boost MSN Web Searches
SiliconValley.com Apr 3 2003 9:41AM GMT
Microsoft says it’ll take on Google
MSNBC Apr 3 2003 6:02AM GMT
Analyst: Jeeves to sell enterprise unit
CNET Apr 3 2003 0:46AM GMT
Microsoft Covets Google’s Niche
Wired News Apr 2 2003 11:37PM GMT
Online portals news
Overture Expanding Globally With MSN Korea
SiliconValley.Internet.com Apr 2 2003 2:44PM GMT
Internet: international news
China will log ‘keyboard clicking’ to rein in the internet
Silicon.com Apr 2 2003 11:01AM GMT
Online portals news
New Portal to International Courts and Tribunals
BeSpacific Apr 2 2003 6:34AM GMT
Online search engines news
Al-Jazeera most sought-after in Internet searches
Yahoo Apr 2 2003 0:24AM GMT
Got a Question? Google It
Readers Digest Apr 1 2003 9:20PM GMT
powered by Moreover.com

Resources

The 2023 B2B Superpowers Index
whitepaper | Analytics

The 2023 B2B Superpowers Index

9m
Data Analytics in Marketing
whitepaper | Analytics

Data Analytics in Marketing

11m
The Third-Party Data Deprecation Playbook
whitepaper | Digital Marketing

The Third-Party Data Deprecation Playbook

1y
Utilizing Email To Stop Fraud-eCommerce Client Fraud Case Study
whitepaper | Digital Marketing

Utilizing Email To Stop Fraud-eCommerce Client Fraud Case Study

2y