What is a robots.txt file and how does it work?

A robots.txt file is a plain text file in your website's root directory that tells search engine crawlers which URLs they can or cannot access. It uses the Robots Exclusion Protocol with directives like User-agent, Allow, Disallow, and Sitemap. Crawlers read this file before accessing your site and respect the rules specified for their user-agent.

What are the common robots.txt directives?

Common directives include: User-agent (specifies which crawler the rules apply to), Disallow (blocks access to specific paths), Allow (permits access to specific paths within a blocked directory), Sitemap (provides XML sitemap location), Crawl-delay (requests delay between requests), and Host (specifies preferred domain). Each directive appears on its own line.

How do I block all crawlers from my site?

To block all crawlers, use: User-agent: * followed by Disallow: /. This tells all crawlers they cannot access any part of your site. However, note that robots.txt is not a security mechanism - determined crawlers can ignore it, and blocked pages may still appear in search results if linked from other sites.

How do I allow all crawlers to access my site?

To allow all crawlers full access, use: User-agent: * followed by Disallow: (empty). An empty Disallow directive means nothing is blocked. Alternatively, you can omit the Disallow line entirely. You can also add Sitemap: https://example.com/sitemap.xml to help crawlers find your content.

Does robots.txt prevent pages from appearing in Google?

No, robots.txt only prevents crawling, not indexing. If a blocked page is linked from other websites, Google may still index its URL without content. To prevent indexing, use noindex meta tags or password protection. For sensitive data, use server-side authentication. Robots.txt is a request, not a security measure.

What is the difference between Allow and Disallow?

Disallow blocks crawlers from accessing specified paths. Allow permits access to specific paths that would otherwise be blocked by a broader Disallow rule. For example, Disallow: /admin/ blocks the entire admin folder, but Allow: /admin/public/ permits access to the public subfolder within admin.

Robots.txt Generator

Create robots.txt files with User-agent, Allow, Disallow, and Sitemap directives for search engine crawlers.

Back to all tools on ToolForge

robots.txt

About Robots.txt Generator

This robots.txt generator creates properly formatted robots.txt files using the Robots Exclusion Protocol. It supports User-agent, Allow, Disallow, and Sitemap directives, helping you control how search engine crawlers access and index your website content.

It is useful for blocking admin panels from search results, preventing duplicate content indexing, directing crawlers to your sitemap, managing crawl budget on large sites, protecting sensitive directories, and ensuring search engines focus on your most important content.

Robots.txt File Structure

A robots.txt file follows a simple line-based format:

Basic Structure:
  User-agent: [crawler-name]
  Disallow: [path]
  Allow: [path]

  Sitemap: [sitemap-url]

Example robots.txt:
  User-agent: *
  Disallow: /admin/
  Disallow: /private/
  Allow: /public/

  Sitemap: https://example.com/sitemap.xml

Key Points:
  - Each directive on its own line
  - Format: Directive: value
  - Blank lines separate rule groups
  - Lines starting with # are comments
  - File must be in website root (/robots.txt)
  - Case-sensitive paths (/Admin/ ≠ /admin/)
  - No quotes around values

Common Crawler User-Agent Names

Search Engine	User-Agent String	Notes
Google (General)	Googlebot	Main Google crawler
Google Images	Googlebot-Image	Indexes images for Google Images
Google News	Googlebot-News	Indexes news content
Google Video	Googlebot-Video	Indexes video content
Bing	bingbot	Microsoft Bing crawler
Yahoo	Slurp	Yahoo Search (powered by Bing)
DuckDuckGo	DuckDuckBot	DuckDuckGo search engine
Baidu	Baiduspider	Chinese search engine
Yandex	YandexBot	Russian search engine
Facebook	facebookexternalhit	Link preview generator
Twitter	Twitterbot	Twitter card preview
All Crawlers	*	Wildcard for all crawlers

Common Robots.txt Examples

Example 1: Allow All Crawlers (Default)
  User-agent: *
  Disallow:

  Sitemap: https://example.com/sitemap.xml

Example 2: Block All Crawlers
  User-agent: *
  Disallow: /

Example 3: Block Specific Folders
  User-agent: *
  Disallow: /admin/
  Disallow: /private/
  Disallow: /temp/
  Disallow: /cgi-bin/

Example 4: Block Specific Crawler
  User-agent: Googlebot
  Disallow: /admin/

  User-agent: *
  Disallow: /private/

Example 5: Allow Subfolder in Blocked Directory
  User-agent: *
  Disallow: /admin/
  Allow: /admin/public/

Example 6: Block Specific File Types
  User-agent: *
  Disallow: /*.pdf$
  Disallow: /*.doc$
  Disallow: /*.xls$

Example 7: Multiple Sitemaps
  User-agent: *
  Disallow:

  Sitemap: https://example.com/sitemap.xml
  Sitemap: https://example.com/sitemap-news.xml
  Sitemap: https://example.com/sitemap-images.xml

Path Pattern Matching

Path Matching Rules:

1. Exact Match:
   Disallow: /admin
   Blocks: /admin (exact URL path)
   Does NOT block: /admin/ or /admin/page

2. Directory Match (with trailing slash):
   Disallow: /admin/
   Blocks: /admin/, /admin/page, /admin/anything
   Does NOT block: /admin (no trailing slash)

3. Wildcard Match (*):
   Disallow: /*.pdf
   Blocks: All URLs ending in .pdf
   Disallow: /admin/*
   Blocks: Everything under /admin/

4. End-of-URL Anchor ($):
   Disallow: /*.pdf$
   Blocks: URLs that END with .pdf
   Disallow: /page.html$
   Blocks: Only /page.html at URL end

5. Pattern Examples:
   /admin          - Matches exact path
   /admin/         - Matches directory and contents
   /*.pdf$         - Matches PDF files at URL end
   /admin/*.html   - Matches HTML files in /admin/
   /*?             - Matches single character after /
   /page*          - Matches /page, /page1, /page2, etc.

Important: Google supports * and $ wildcards.
Other crawlers may not support wildcards.

Robots.txt Best Practices

Practice	Recommendation
File location	Must be in root directory (https://domain.com/robots.txt)
File size	Google limit: 500 KB maximum
Character encoding	Use UTF-8 encoding for international characters
Case sensitivity	Paths are case-sensitive (/Admin/ ≠ /admin/)
Comments	Use # for comments to document your rules
Testing	Always test in Google Search Console before deploying
Sitemap location	Include Sitemap directive to help crawlers find content
Crawl-delay	Use sparingly; Google ignores this directive

Common Mistakes to Avoid

Blocking CSS/JS files: Blocking resources Google needs to render pages can hurt rankings. Allow /css/ and /js/ directories.
Using robots.txt for security: Robots.txt is not a security measure. Anyone can access blocked URLs. Use authentication for sensitive content.
Blocking already-indexed pages: If a page is already indexed, blocking it with robots.txt won't remove it from search results. Use noindex or removal tools.
Incorrect path syntax: Paths must start with / (forward slash). Disallow: admin is invalid; use Disallow: /admin/.
Forgetting the sitemap: Always include your sitemap URL to help crawlers discover all your content efficiently.
Over-blocking content: Be careful with wildcards. Disallow: /*.php could block important pages unintentionally.
Not testing changes: Always test robots.txt changes in Google Search Console before deploying to production.

Testing and Validation

Testing Tools:

1. Google Search Console - Robots.txt Tester
   - Access via: Search Console → Settings → robots.txt
   - Tests against Googlebot behavior
   - Shows which URLs are blocked
   - Validates syntax and warns of issues

2. Bing Webmaster Tools
   - Similar testing functionality for Bing
   - Shows how bingbot interprets rules

3. Manual Testing:
   - Visit: https://yourdomain.com/robots.txt
   - Verify file is accessible (HTTP 200)
   - Check for typos and syntax errors
   - Ensure paths match your URL structure

4. URL Inspection:
   - Use Search Console URL Inspection tool
   - Check if specific URLs are blocked
   - View crawling status

Common Error Messages:
  - "Unable to fetch" - File not accessible
  - "Syntax error" - Invalid directive format
  - "Warning: Blocked by robots.txt" - URL blocked from crawling

SEO Impact of Robots.txt

Crawl budget optimization: Blocking low-value pages (admin, search results, filters) helps crawlers focus on important content.
Duplicate content prevention: Block printer-friendly versions, session IDs, and filtered product listings to avoid duplicate content issues.
Server load reduction: Blocking aggressive crawlers or specific paths can reduce server load during high-traffic periods.
Staging site protection: Block search engines from development/staging sites to prevent premature indexing.
Not a ranking factor: Proper robots.txt usage doesn't directly improve rankings, but it helps crawlers efficiently discover and index your content.

Frequently Asked Questions

What is a robots.txt file and how does it work?: A robots.txt file is a plain text file in your website's root directory that tells search engine crawlers which URLs they can or cannot access. It uses the Robots Exclusion Protocol with directives like User-agent, Allow, Disallow, and Sitemap. Crawlers read this file before accessing your site and respect the rules specified for their user-agent.
What are the common robots.txt directives?: Common directives include: User-agent (specifies which crawler the rules apply to), Disallow (blocks access to specific paths), Allow (permits access to specific paths within a blocked directory), Sitemap (provides XML sitemap location), Crawl-delay (requests delay between requests), and Host (specifies preferred domain). Each directive appears on its own line.
How do I block all crawlers from my site?: To block all crawlers, use: User-agent: * followed by Disallow: /. This tells all crawlers they cannot access any part of your site. However, note that robots.txt is not a security mechanism - determined crawlers can ignore it, and blocked pages may still appear in search results if linked from other sites.
How do I allow all crawlers to access my site?: To allow all crawlers full access, use: User-agent: * followed by Disallow: (empty). An empty Disallow directive means nothing is blocked. Alternatively, you can omit the Disallow line entirely. You can also add Sitemap: https://example.com/sitemap.xml to help crawlers find your content.
Does robots.txt prevent pages from appearing in Google?: No, robots.txt only prevents crawling, not indexing. If a blocked page is linked from other websites, Google may still index its URL without content. To prevent indexing, use noindex meta tags or password protection. For sensitive data, use server-side authentication. Robots.txt is a request, not a security measure.
What is the difference between Allow and Disallow?: Disallow blocks crawlers from accessing specified paths. Allow permits access to specific paths that would otherwise be blocked by a broader Disallow rule. For example, Disallow: /admin/ blocks the entire admin folder, but Allow: /admin/public/ permits access to the public subfolder within admin.