Robots.txt Generator
Create robots.txt files with User-agent, Allow, Disallow, and Sitemap directives for search engine crawlers.
Back to all tools on ToolForge
User-agent:
Disallow paths:
Allow paths:
Sitemap URL:
robots.txt
About Robots.txt Generator
This robots.txt generator creates properly formatted robots.txt files using the Robots Exclusion Protocol. It supports User-agent, Allow, Disallow, and Sitemap directives, helping you control how search engine crawlers access and index your website content.
It is useful for blocking admin panels from search results, preventing duplicate content indexing, directing crawlers to your sitemap, managing crawl budget on large sites, protecting sensitive directories, and ensuring search engines focus on your most important content.
Robots.txt File Structure
A robots.txt file follows a simple line-based format:
Basic Structure: User-agent: [crawler-name] Disallow: [path] Allow: [path] Sitemap: [sitemap-url] Example robots.txt: User-agent: * Disallow: /admin/ Disallow: /private/ Allow: /public/ Sitemap: https://example.com/sitemap.xml Key Points: - Each directive on its own line - Format: Directive: value - Blank lines separate rule groups - Lines starting with # are comments - File must be in website root (/robots.txt) - Case-sensitive paths (/Admin/ ≠ /admin/) - No quotes around values
Common Crawler User-Agent Names
| Search Engine | User-Agent String | Notes |
|---|---|---|
| Google (General) | Googlebot | Main Google crawler |
| Google Images | Googlebot-Image | Indexes images for Google Images |
| Google News | Googlebot-News | Indexes news content |
| Google Video | Googlebot-Video | Indexes video content |
| Bing | bingbot | Microsoft Bing crawler |
| Yahoo | Slurp | Yahoo Search (powered by Bing) |
| DuckDuckGo | DuckDuckBot | DuckDuckGo search engine |
| Baidu | Baiduspider | Chinese search engine |
| Yandex | YandexBot | Russian search engine |
| facebookexternalhit | Link preview generator | |
| Twitterbot | Twitter card preview | |
| All Crawlers | * | Wildcard for all crawlers |
Common Robots.txt Examples
Example 1: Allow All Crawlers (Default) User-agent: * Disallow: Sitemap: https://example.com/sitemap.xml Example 2: Block All Crawlers User-agent: * Disallow: / Example 3: Block Specific Folders User-agent: * Disallow: /admin/ Disallow: /private/ Disallow: /temp/ Disallow: /cgi-bin/ Example 4: Block Specific Crawler User-agent: Googlebot Disallow: /admin/ User-agent: * Disallow: /private/ Example 5: Allow Subfolder in Blocked Directory User-agent: * Disallow: /admin/ Allow: /admin/public/ Example 6: Block Specific File Types User-agent: * Disallow: /*.pdf$ Disallow: /*.doc$ Disallow: /*.xls$ Example 7: Multiple Sitemaps User-agent: * Disallow: Sitemap: https://example.com/sitemap.xml Sitemap: https://example.com/sitemap-news.xml Sitemap: https://example.com/sitemap-images.xml
Path Pattern Matching
Path Matching Rules: 1. Exact Match: Disallow: /admin Blocks: /admin (exact URL path) Does NOT block: /admin/ or /admin/page 2. Directory Match (with trailing slash): Disallow: /admin/ Blocks: /admin/, /admin/page, /admin/anything Does NOT block: /admin (no trailing slash) 3. Wildcard Match (*): Disallow: /*.pdf Blocks: All URLs ending in .pdf Disallow: /admin/* Blocks: Everything under /admin/ 4. End-of-URL Anchor ($): Disallow: /*.pdf$ Blocks: URLs that END with .pdf Disallow: /page.html$ Blocks: Only /page.html at URL end 5. Pattern Examples: /admin - Matches exact path /admin/ - Matches directory and contents /*.pdf$ - Matches PDF files at URL end /admin/*.html - Matches HTML files in /admin/ /*? - Matches single character after / /page* - Matches /page, /page1, /page2, etc. Important: Google supports * and $ wildcards. Other crawlers may not support wildcards.
Robots.txt Best Practices
| Practice | Recommendation |
|---|---|
| File location | Must be in root directory (https://domain.com/robots.txt) |
| File size | Google limit: 500 KB maximum |
| Character encoding | Use UTF-8 encoding for international characters |
| Case sensitivity | Paths are case-sensitive (/Admin/ ≠ /admin/) |
| Comments | Use # for comments to document your rules |
| Testing | Always test in Google Search Console before deploying |
| Sitemap location | Include Sitemap directive to help crawlers find content |
| Crawl-delay | Use sparingly; Google ignores this directive |
Common Mistakes to Avoid
- Blocking CSS/JS files: Blocking resources Google needs to render pages can hurt rankings. Allow /css/ and /js/ directories.
- Using robots.txt for security: Robots.txt is not a security measure. Anyone can access blocked URLs. Use authentication for sensitive content.
- Blocking already-indexed pages: If a page is already indexed, blocking it with robots.txt won't remove it from search results. Use noindex or removal tools.
- Incorrect path syntax: Paths must start with / (forward slash). Disallow: admin is invalid; use Disallow: /admin/.
- Forgetting the sitemap: Always include your sitemap URL to help crawlers discover all your content efficiently.
- Over-blocking content: Be careful with wildcards. Disallow: /*.php could block important pages unintentionally.
- Not testing changes: Always test robots.txt changes in Google Search Console before deploying to production.
Testing and Validation
Testing Tools: 1. Google Search Console - Robots.txt Tester - Access via: Search Console → Settings → robots.txt - Tests against Googlebot behavior - Shows which URLs are blocked - Validates syntax and warns of issues 2. Bing Webmaster Tools - Similar testing functionality for Bing - Shows how bingbot interprets rules 3. Manual Testing: - Visit: https://yourdomain.com/robots.txt - Verify file is accessible (HTTP 200) - Check for typos and syntax errors - Ensure paths match your URL structure 4. URL Inspection: - Use Search Console URL Inspection tool - Check if specific URLs are blocked - View crawling status Common Error Messages: - "Unable to fetch" - File not accessible - "Syntax error" - Invalid directive format - "Warning: Blocked by robots.txt" - URL blocked from crawling
SEO Impact of Robots.txt
- Crawl budget optimization: Blocking low-value pages (admin, search results, filters) helps crawlers focus on important content.
- Duplicate content prevention: Block printer-friendly versions, session IDs, and filtered product listings to avoid duplicate content issues.
- Server load reduction: Blocking aggressive crawlers or specific paths can reduce server load during high-traffic periods.
- Staging site protection: Block search engines from development/staging sites to prevent premature indexing.
- Not a ranking factor: Proper robots.txt usage doesn't directly improve rankings, but it helps crawlers efficiently discover and index your content.
Frequently Asked Questions
- What is a robots.txt file and how does it work?
- A robots.txt file is a plain text file in your website's root directory that tells search engine crawlers which URLs they can or cannot access. It uses the Robots Exclusion Protocol with directives like User-agent, Allow, Disallow, and Sitemap. Crawlers read this file before accessing your site and respect the rules specified for their user-agent.
- What are the common robots.txt directives?
- Common directives include: User-agent (specifies which crawler the rules apply to), Disallow (blocks access to specific paths), Allow (permits access to specific paths within a blocked directory), Sitemap (provides XML sitemap location), Crawl-delay (requests delay between requests), and Host (specifies preferred domain). Each directive appears on its own line.
- How do I block all crawlers from my site?
- To block all crawlers, use: User-agent: * followed by Disallow: /. This tells all crawlers they cannot access any part of your site. However, note that robots.txt is not a security mechanism - determined crawlers can ignore it, and blocked pages may still appear in search results if linked from other sites.
- How do I allow all crawlers to access my site?
- To allow all crawlers full access, use: User-agent: * followed by Disallow: (empty). An empty Disallow directive means nothing is blocked. Alternatively, you can omit the Disallow line entirely. You can also add Sitemap: https://example.com/sitemap.xml to help crawlers find your content.
- Does robots.txt prevent pages from appearing in Google?
- No, robots.txt only prevents crawling, not indexing. If a blocked page is linked from other websites, Google may still index its URL without content. To prevent indexing, use noindex meta tags or password protection. For sensitive data, use server-side authentication. Robots.txt is a request, not a security measure.
- What is the difference between Allow and Disallow?
- Disallow blocks crawlers from accessing specified paths. Allow permits access to specific paths that would otherwise be blocked by a broader Disallow rule. For example, Disallow: /admin/ blocks the entire admin folder, but Allow: /admin/public/ permits access to the public subfolder within admin.