Friday, January 5, 2007

Blocking Robots

Webmasters can instruct spiders not to crawl certain files or directories through the standard robots.txt file in the root directory of the domain. Additionally, a page can be explicitly excluded from a search engine's database by using a robots meta tag.

When a search engine visits a site, the robots.txt located in the root folder is the first file crawled. The robots.txt file is then parsed, and only pages not disallowed will be crawled. As a search engine crawler may keep a cached copy of this file, it may on occasion crawl pages a webmaster does not wished crawled.

Pages typically prevented from being crawled include login specific pages such as shopping carts and user-specific content such as search results from internal searches.

No comments: