How to stop robots

Ever wondered why so many clients are interested in a file called robots.txt which you don't have, and never did have ?

These clients are called robots - special automated clients which wander around the web looking for interesting resources.

Most robots are used to generate some kind of web index which is then used by a search engine to help locate information.

robots.txt provides a means to request that robots limit their activities at the site, or more often than not, to leave the site alone.

When the first robots were developed, they had a bad reputation for sending hundreds of requests to each site, often resulting in the site being overloaded. Things have improved dramatically since then, thanks to Guidlines for Robot Writers, but even so, some robots may exhibit unfriendly behaviour which the webmaster isn't willing to tollerate.

Another reason some webmasters want to block access to robots, results from the way in which the information collected by the robots is subsequently indexed. There are currently no well used systems to annotate documents such that they can be indexed by wandering robots. Hence, the index writer will often revert to unsatisfactory algorithms to determine what gets indexed.

Typically, indexes are built around text which appears in document titles (<TITLE>), or main headings (<H1>), and more often than not, the words it indexes on are completely irrelevant or misleading for the docuement subject. The worst index is one based on every word in the document. This inevitably leads to the search engines offering poor suggestions which waste both the users and the servers valuable time

So if you decide to exclude robots completely, or just limit the areas in which they can roam, set up a robots.txt file, and refer to the robot exclusion documentation.

Much better systems exist to both index your site and publicise its resources, e.g. ALIWEB, which uses site defined index files.

Home Index