Freshness & Comprehensiveness
The two most important factors that affect Web search databases are: freshness and comprehensiveness. By 1994, there was already a tremendous amount of Web content and the growth to today's several billions of Web documents has been a spectacular thing to watch. To obtain comprehensiveness, as much of the Web as possible must be recorded. Portions of this content changes rather frequently. To keep a database fresh, new and newly changed Web documents must be recorded in a timely manner and out-of-date and missing documents must be discarded.
We will now discuss the two models of Web search technologies, directory-based Web search services, and spider-based search services.
Directory-based Web search
"Jerry's Guide to the Internet" was a Web site created by Jerry Wang and David Filo as graduate students at Stanford University. They created the most comprehensive collection of (human) categorized Web sites available by 1994. This directory-based search service was then renamed Yahoo! and proved to be very successful at cataloging major Web sites during the Web's formative years and attracted a wide audience and loyal users.
A directory-based system categorizes Web sites and presents them according to those categories. Providing a directory based search service requires that editors review and categorize Web sites and enter abstractions manually into a central database. Site abstractions are displayed to users with two primary elements, site Title and Description. Site Titles and Descriptions originate either from the editors themselves (upon reviewing the site) or are adapted from site owner submissions. A site owner will typically use a Web HTML form or Email to suggest their site for review and to propose their version of an ideal site Title and Description. There will be more about how to perform this process later in Directory Enhancement (7).
Directory based search services quickly were viewed as impractical in the pursuit of a more comprehensive Web search. Even with an army of editors 25,000+ strong, the largest directory with nearly 3 million Web sites does not amount to much in a Web with billions of documents. Directories also prove to be inadequate when compared with spider-based search services regarding freshness. Spiders re-record popular and frequently changing Web documents with a timely efficiency by using machines that automate the recording process.
Spider based Web search
The same underlying protocols that enable a Web user to traverse the Web from one document to another using hypertext also enable machines to simulate the same process actually traversing or crawling the Web, (to crawl the Web, as in a spider-robot crawling the Internet's Web). One advantage of using machines is that machines can be used to automatically record the visited documents into a central database. By recording a Web document's content, (minus unimportant code), a broader Web search can be delivered than is possible with directory searches. Robots, crawlers, spiders, indexers and bots are all terms applied to these machines that facilitate and automate the recording process of Web pages. A search service employing the use of spiders has several advantages over its directory-based rivals.
Spider-based searches provide a comprehensive Web search against a set of millions or billions of Web documents, thus enabling broader query matching capabilities than directory-based systems. Freshness is also easier to manage using spidering machines with regular automated refreshing of popular and frequently changing content. Spidering the whole Web is not widely viewed as an important goal despite what any spider-based search engine may claims they plan to do. Spider-based search engines simply need to crawl the most popular and highest quality content. Much of what gets crawled is thrown out as duplication or as low-grade content without much value, (think of a site disclaimer or its privacy policy), and, for all intents and purposes, comprehensiveness is still generally achieved. One benefit, however, that directory search services do provide over spider-based search services is that an entirely pre-qualified set of results inherently assures a high level of quality in Web search.
Next: Hybrid Web search