Web: Blocking Robots

Friday, January 5, 2007

Blocking Robots

Webmasters can instruct spiders not to crawl certain files or directories through the standard robots.txt file in the root directory of the domain. Additionally, a page can be explicitly excluded from a search engine's database by using a robots meta tag.

When a search engine visits a site, the robots.txt located in the root folder is the first file crawled. The robots.txt file is then parsed, and only pages not disallowed will be crawled. As a search engine crawler may keep a cached copy of this file, it may on occasion crawl pages a webmaster does not wished crawled.

Pages typically prevented from being crawled include login specific pages such as shopping carts and user-specific content such as search results from internal searches.

Web

Friday, January 5, 2007

Blocking Robots

No comments:

Blog Archive

Categories

Recommended Sites