How does the robots.txt path matching algorithm work?

Path matching finds the most specific rule that applies to a URL. Rules are matched by checking if the URL path starts with the Disallow or Allow path value. When multiple rules match, the longest (most specific) path wins. For example, if Disallow: /admin/ and Allow: /admin/public/ both match, the longer Allow rule takes precedence for /admin/public/page.

What happens if no rule matches my URL?

If no Disallow or Allow rule matches the tested URL path, the URL is considered allowed by default. Search crawlers can access and index URLs that aren't explicitly blocked. This is why an empty robots.txt file (or one with only 'User-agent: *' and no Disallow directives) allows full site access.

How do I test rules for a specific crawler?

Enter the crawler's user-agent name (e.g., Googlebot, bingbot, Slurp) in the User-agent field. The tester first looks for rules specific to that user-agent. If no specific rules exist, it falls back to rules for '*' (wildcard), which applies to all crawlers. This matches how real crawlers interpret robots.txt.

Do wildcards (*) work in robots.txt testing?

Google supports * (matches any sequence) and $ (end-of-URL anchor) wildcards in robots.txt. However, not all crawlers support wildcards. This tester uses simple prefix matching for compatibility. For wildcard testing, verify with Google Search Console's robots.txt tester for Googlebot-specific behavior.

What is the order of rule evaluation?

Crawlers evaluate rules in order: 1) Find the user-agent group (specific agent or wildcard), 2) Collect all Allow and Disallow rules for that group, 3) Find rules that match the URL path, 4) Select the longest matching path, 5) Apply that rule (Allow or Disallow). If no rules match, access is allowed.

Why is my URL blocked even though I have an Allow rule?

The Allow rule might be less specific than a Disallow rule. For example, 'Allow: /page' and 'Disallow: /page/' - the URL '/page/sub' matches both, but '/page/' is longer, so Disallow wins. Ensure Allow paths are more specific than Disallow paths, or reorder rules so Allow appears after Disallow.

Robots.txt Tester

Test whether a URL path is allowed or blocked for a specific user-agent based on robots.txt rules.

Back to all tools on ToolForge

About Robots.txt Tester

This robots.txt tester simulates how search engine crawlers interpret robots.txt rules. It parses your robots.txt content, finds rules for a specific user-agent, and determines whether a given URL path would be allowed or blocked based on the most specific matching rule.

It is useful for troubleshooting why pages aren't being indexed, verifying new robots.txt rules before deployment, testing edge cases with Allow/Disallow conflicts, debugging crawler access issues, and ensuring important pages aren't accidentally blocked from search engines.

How Path Matching Works

The tester uses the same matching logic as search engine crawlers:

Matching Algorithm:

1. Parse robots.txt into rule groups by user-agent
2. Find rules for the specified user-agent
3. If no specific rules, fall back to "*" (wildcard)
4. Check all Allow and Disallow rules against the path
5. Find the LONGEST matching path (most specific rule)
6. Apply that rule: Allow = accessible, Disallow = blocked
7. If no rules match, URL is ALLOWED by default

Example:
  User-agent: Googlebot
  Disallow: /admin/
  Disallow: /private/
  Allow: /admin/public/

Test Path: /admin/public/page.html
  - Matches: Disallow: /admin/ (length 7)
  - Matches: Allow: /admin/public/ (length 15)
  - Winner: Allow (longer path)
  - Result: ALLOWED

Test Path: /admin/settings
  - Matches: Disallow: /admin/ (length 7)
  - No Allow rule matches
  - Result: BLOCKED

Test Path: /public/page
  - No matching rules
  - Result: ALLOWED (default)

Test Result Examples

Example 1: Simple Block
robots.txt:
  User-agent: *
  Disallow: /private/

Test: User-agent = Googlebot, Path = /private/secret
Result: BLOCKED
Matched rule: DISALLOW: /private/

Example 2: Allow Override
robots.txt:
  User-agent: *
  Disallow: /admin/
  Allow: /admin/public/

Test: User-agent = Googlebot, Path = /admin/public/docs
Result: ALLOWED
Matched rule: ALLOW: /admin/public/

Example 3: Specific User-Agent
robots.txt:
  User-agent: Googlebot
  Disallow: /no-google/

  User-agent: Bingbot
  Disallow: /no-bing/

Test: User-agent = Googlebot, Path = /no-google/page
Result: BLOCKED
Matched rule: DISALLOW: /no-google/

Test: User-agent = Bingbot, Path = /no-google/page
Result: ALLOWED (no matching rule for Bingbot)

Example 4: No Match = Allowed
robots.txt:
  User-agent: *
  Disallow: /admin/

Test: User-agent = Googlebot, Path = /public/page
Result: ALLOWED
Reason: No matching rule found

Common User-Agent Strings

Search Engine	User-Agent Value	Test With
Google (Main)	Googlebot	Googlebot
Google Images	Googlebot-Image	Googlebot-Image
Google News	Googlebot-News	Googlebot-News
Bing	bingbot	bingbot
Yahoo	Slurp	Slurp
DuckDuckGo	DuckDuckBot	DuckDuckBot
Baidu	Baiduspider	Baiduspider
Yandex	YandexBot	YandexBot
All Crawlers	*	* or any agent

Path Matching Scenarios

Scenario 1: Exact Directory Match
  Rule: Disallow: /admin/
  /admin/        → BLOCKED
  /admin/page    → BLOCKED
  /admin         → ALLOWED (no trailing slash)
  /administrator → ALLOWED (different path)

Scenario 2: File Extension Block
  Rule: Disallow: /*.pdf$
  /file.pdf      → BLOCKED
  /file.pdf?id=1 → BLOCKED (if crawler supports $)
  /file.html     → ALLOWED
  /pdf/file.html → ALLOWED

Scenario 3: Nested Allow/Disallow
  Rules:
    Disallow: /files/
    Allow: /files/public/

  /files/            → BLOCKED
  /files/private/    → BLOCKED
  /files/public/     → ALLOWED
  /files/public/doc  → ALLOWED

Scenario 4: Multiple Rules Same Path
  Rules:
    Allow: /page
    Disallow: /page

  /page → ALLOWED (Allow listed last wins for same length)

Scenario 5: Wildcard Patterns (Google only)
  Rule: Disallow: /*?
  /page          → ALLOWED (no query string)
  /page?id=1     → BLOCKED (has query parameter)
  /page?a=1&b=2  → BLOCKED

Note: Wildcards (* and $) are Google-specific.
Other crawlers may ignore them.

Troubleshooting Guide

Problem	Possible Cause	Solution
Page not indexed	Accidentally blocked by robots.txt	Test URL path; remove or adjust Disallow rule
Wrong crawler blocked	Using * instead of specific user-agent	Create crawler-specific rule groups
Allow not working	Allow path shorter than Disallow	Make Allow path more specific (longer)
Subdirectories blocked	Missing trailing slash on Disallow	Add trailing slash: /admin/ not /admin
Query strings blocked	Wildcard rule affecting URLs with ?	Review /* patterns; test with query strings

Robots.txt Syntax Reference

Valid Directives:
  User-agent: [crawler-name]  - Specifies which crawler
  Disallow: [path]            - Blocks access to path
  Allow: [path]               - Permits access to path
  Sitemap: [url]              - Sitemap location
  Crawl-delay: [seconds]      - Request delay (ignored by Google)
  Host: [domain]              - Preferred domain (Yandex only)

Comments:
  # This is a comment
  # Comments are ignored by crawlers
  # Use comments to document your rules

Syntax Rules:
  - One directive per line
  - Format: Directive: value
  - No quotes around values
  - Case-sensitive paths
  - Paths must start with /
  - Empty Disallow means "allow all"
  - Lines are case-insensitive for directives

Invalid Syntax (will be ignored):
  Disallow:admin      # Missing leading /
  disallow: /admin    # Lowercase directive (OK in some parsers)
  Disallow: admin     # Relative path (invalid)
  "Disallow: /admin"  # Quotes included (invalid)

Testing Best Practices

Test critical URLs: Always test your homepage, important landing pages, and product pages to ensure they're not blocked.
Test multiple crawlers: If you have crawler-specific rules, test with each relevant user-agent (Googlebot, bingbot, etc.).
Test edge cases: Check URLs at directory boundaries, with query strings, and with file extensions.
Verify after changes: Re-test all critical URLs after modifying your robots.txt file.
Use Google Search Console: For production sites, verify your robots.txt in Search Console's robots.txt Tester tool.
Document your rules: Add comments explaining why certain paths are blocked for future reference.

Limitations of This Tester

Wildcard support: This tester uses simple prefix matching. Google's advanced wildcards (* and $) are not fully simulated.
Crawl-delay: This directive is not tested; Google ignores it anyway.
Multiple rule groups: Complex multi-group robots.txt files may have edge cases not covered.
Real-time fetching: This tester doesn't fetch your actual robots.txt; you must paste the content.
Crawler behavior: Different crawlers may interpret rules slightly differently. Always verify with official tools.

Frequently Asked Questions

How does the robots.txt path matching algorithm work?: Path matching finds the most specific rule that applies to a URL. Rules are matched by checking if the URL path starts with the Disallow or Allow path value. When multiple rules match, the longest (most specific) path wins. For example, if Disallow: /admin/ and Allow: /admin/public/ both match, the longer Allow rule takes precedence for /admin/public/page.
What happens if no rule matches my URL?: If no Disallow or Allow rule matches the tested URL path, the URL is considered allowed by default. Search crawlers can access and index URLs that aren't explicitly blocked. This is why an empty robots.txt file (or one with only 'User-agent: *' and no Disallow directives) allows full site access.
How do I test rules for a specific crawler?: Enter the crawler's user-agent name (e.g., Googlebot, bingbot, Slurp) in the User-agent field. The tester first looks for rules specific to that user-agent. If no specific rules exist, it falls back to rules for '*' (wildcard), which applies to all crawlers. This matches how real crawlers interpret robots.txt.
Do wildcards (*) work in robots.txt testing?: Google supports * (matches any sequence) and $ (end-of-URL anchor) wildcards in robots.txt. However, not all crawlers support wildcards. This tester uses simple prefix matching for compatibility. For wildcard testing, verify with Google Search Console's robots.txt tester for Googlebot-specific behavior.
What is the order of rule evaluation?: Crawlers evaluate rules in order: 1) Find the user-agent group (specific agent or wildcard), 2) Collect all Allow and Disallow rules for that group, 3) Find rules that match the URL path, 4) Select the longest matching path, 5) Apply that rule (Allow or Disallow). If no rules match, access is allowed.
Why is my URL blocked even though I have an Allow rule?: The Allow rule might be less specific than a Disallow rule. For example, 'Allow: /page' and 'Disallow: /page/' - the URL '/page/sub' matches both, but '/page/' is longer, so Disallow wins. Ensure Allow paths are more specific than Disallow paths, or reorder rules so Allow appears after Disallow.