Robots.txt

Technical SEO

A text file at the root of a website that tells search engine crawlers which pages or sections of the site they are allowed or not allowed to crawl.

Definition

Robots.txt is a plain text file placed at the root directory of a website (example.com/robots.txt) that provides instructions to search engine crawlers about which parts of the site they should and should not access. It uses a simple syntax with User-agent (which crawler the rule applies to) and Disallow/Allow directives.

Robots.txt is part of the Robots Exclusion Protocol, a convention that crawlers voluntarily follow. It is not a security mechanism. Disallowing a page in robots.txt prevents it from being crawled but does not prevent it from being indexed if other pages link to it.

Why It Matters

Proper robots.txt configuration prevents search engines from wasting crawl budget on pages that should not be indexed (admin panels, duplicate content, staging environments). It also prevents embarrassing situations where internal or draft pages appear in search results.

With the rise of AI crawlers (GPTBot, ClaudeBot, PerplexityBot), robots.txt has gained a new role: controlling which AI systems can access your content for training purposes.

Examples

A basic robots.txt file:

txt
User-agent: *
Disallow: /admin/
Disallow: /staging/
Allow: /

User-agent: GPTBot
Disallow: /proprietary-research/

Sitemap: https://example.com/sitemap.xml

This configuration allows all crawlers to access the site except the /admin/ and /staging/ directories, blocks GPTBot from proprietary research pages, and points crawlers to the sitemap.

All Glossary Terms
See it in action

Every article on our blog was written by Acta AI. No edits. No ghostwriter.

Read Our BlogStart Free Trial