Robots.txt Guide — How to Control Search Engine Crawling

What Robots.txt Does

The robots.txt file is a plain text file that sits in your website root directory and tells search engine crawlers which pages or sections they should or should not visit. When Googlebot, Bingbot, or any other well-behaved crawler arrives at your site, the first thing it does is check for a robots.txt file. The instructions in this file guide which URLs the crawler will request and which it will skip.

Important to understand: robots.txt is a suggestion, not a security measure. Well-behaved crawlers like Googlebot respect these directives, but malicious bots and scrapers ignore them entirely. Never rely on robots.txt to hide sensitive information — use proper authentication, server-level access controls, or the noindex meta tag instead.

Basic Syntax and Rules

The robots.txt format is simple. Each block starts with a User-agent line specifying which crawler the rules apply to, followed by Allow and Disallow directives. A User-agent of * means the rules apply to all crawlers. Disallow: /admin/ tells crawlers not to visit any URL starting with /admin/. Allow: /admin/public/ creates an exception within a disallowed directory.

Rules are matched from top to bottom, and more specific rules take precedence. An empty Disallow line means nothing is disallowed — the crawler can access everything. A single Disallow: / blocks the entire site. Comments start with the hash symbol and are ignored by crawlers. Each directive must be on its own line.

Common Robots.txt Patterns

WordPress sites typically disallow /wp-admin/ while allowing /wp-admin/admin-ajax.php (which some themes and plugins need for frontend functionality). Ecommerce sites often disallow search result pages, cart pages, and checkout pages — these add no SEO value and waste crawl budget. Development or staging environments should disallow everything to prevent search engines from indexing test content.

For most content websites, a minimal robots.txt works best — allow everything except administrative areas and internal search results. Over-blocking with robots.txt is more harmful than under-blocking. If you accidentally disallow important pages, they will disappear from search results. Use our Robots.txt Checker at safewebtools.com to verify your file is syntactically correct and not accidentally blocking important pages.

Sitemap Declaration in Robots.txt

The robots.txt file is also the standard place to declare your XML sitemap location. Adding a Sitemap line at the bottom tells search engines where to find your sitemap. This is especially useful because crawlers check robots.txt on every visit, so they will always have the current sitemap URL even if you change it.

You can declare multiple sitemaps — one per line. If you have separate sitemaps for posts, pages, and images, list all of them. Some CMS platforms like WordPress with Yoast SEO generate a sitemap index file that references multiple sub-sitemaps. In that case, you only need to declare the index file in robots.txt.

Crawl Budget and Why It Matters

Search engines allocate a certain amount of resources to crawl each website — this is your crawl budget. For small sites with a few hundred pages, crawl budget is rarely an issue. But for large sites with thousands or millions of pages, how you spend your crawl budget directly affects how quickly new content gets indexed and how frequently existing content gets refreshed.

Blocking unnecessary pages with robots.txt helps search engines focus their crawling on your most important content. Pages that waste crawl budget include duplicate content from URL parameters, paginated archive pages beyond the first few, internal search result pages, and tag or author archive pages with thin content. By directing crawlers away from these low-value pages, you help them discover and index your valuable content faster.

Testing and Monitoring Your Robots.txt

Google Search Console includes a robots.txt tester that lets you check whether specific URLs are blocked or allowed by your current file. Use this tool whenever you make changes to verify you have not accidentally blocked important pages. Our Robots.txt Checker tool provides similar validation along with syntax checking and common mistake detection.

Monitor your robots.txt file for unauthorized changes. If someone gains access to your site and modifies robots.txt to disallow all crawling, your entire site can disappear from search results within days. Include robots.txt in your file integrity monitoring and check it periodically to ensure it matches your intended configuration.