Robots exclusion standard

related topics
{math, number, function}
{work, book, publish}
{system, computer, user}
{law, state, case}
{ship, engine, design}
{style, bgcolor, rowspan}
{rate, high, increase}

The Robot Exclusion Standard, also known as the Robots Exclusion Protocol or robots.txt protocol, is a convention to prevent cooperating web spiders and other web robots from accessing all or part of a website which is otherwise publicly viewable. Robots are often used by search engines to categorize and archive web sites, or by webmasters to proofread source code. The standard is unrelated to, but can be used in conjunction with, Sitemaps, a robot inclusion standard for websites.

Contents

History

The invention of "robots.txt" is attributed to Martijn Koster, when working for WebCrawler around 1994. "robots.txt" was then popularized with the advent of AltaVista, and other popular search engines, in the following years.

About the standard

If a site owner wishes to give instructions to web robots they must place a text file called robots.txt in the root of the web site hierarchy (e.g. www.example.com/robots.txt). This text file should contain the instructions in a specific format (see examples below). Robots that choose to follow the instructions try to fetch this file and read the instructions before fetching any other file from the web site. If this file doesn't exist web robots assume that the web owner wishes to provide no specific instructions.

A robots.txt file on a website will function as a request that specified robots ignore specified files or directories in their search. This might be, for example, out of a preference for privacy from search engine results, or the belief that the content of the selected directories might be misleading or irrelevant to the categorization of the site as a whole, or out of a desire that an application only operate on certain data.

For websites with multiple subdomains, each subdomain must have its own robots.txt file. If example.com had a robots.txt file but a.example.com did not, the rules that would apply for example.com would not apply to a.example.com.

Disadvantages

The protocol is purely advisory. It relies on the cooperation of the web robot, so that marking an area of a site out of bounds with robots.txt does not guarantee privacy. Some web site administrators have tried to use the robots file to make private parts of a website invisible to the rest of the world, but the file is necessarily publicly available and its content is easily checked by anyone with a web browser.

Full article ▸

related documents
Standard Generalized Markup Language
Richard Dedekind
HMAC
Pike (programming language)
Fermat's little theorem
Bounded set
Pre-Abelian category
Dimension (vector space)
String searching algorithm
Tuple
Enriched category
Transfinite induction
Square-free integer
Elementary function
Banach algebra
Linear classifier
Commutator
Floor and ceiling functions
ElGamal encryption
Cyclone (programming language)
Möbius function
Twin prime
Binary space partitioning
Möbius inversion formula
Gaussian integer
Burali-Forti paradox
PSPACE
Connected space
Key (cryptography)
Fuzzy set