Friday, January 5, 2007

Getting into Search Engines' Databases

Today's major search engines, by and large, do not require any extra effort to submit to, as they are capable of finding pages via links on other sites.

However, Google and Yahoo offer submission programs, such as Google Sitemaps, for which an XML type feed can be created and submitted. Generally, however, a simple link from a site already indexed will get the search engines to visit a new site and begin spidering its contents. It can take a few days or even weeks from the acquisition of a link from such a site for all the main search engine spiders to begin indexing a new site, and there is usually not much that can be done to speed up this process.

Once the search engine finds a new site, it uses a crawler program to retrieve and index the pages on the site. Pages can only be found when linked to with visible hyperlinks. For instance, some search engines are starting to read links created by Flash (for example, Google).
Search engine crawlers may look at a number of different factors when crawling a site, and many pages from a site may not be indexed by the search engines until they gain more PageRank, links or traffic. Distance of pages from the root directory of a site may also be a factor in whether or not pages get crawled, as well as other importance metrics. Cho et al. described some standards for those decisions as to which pages are visited and sent by a crawler to be included in a search engine's index.

A few search engines, such as Yahoo!, operate paid submission services that guarantee crawling for either a set fee or CPC. Such programs usually guarantee inclusion in the database, but does not guarantee specific ranking within the search results.

No comments: