Robots.txt Tester
Fetch a live robots.txt, run several paths for any crawler token, and see which group got picked and which rule won.
This robots.txt tester fetches the live file at a domain root server-side, then turns the wall of directives into a plain verdict. Type a host and a crawler token, paste the paths you care about, and it pulls out the user-agent groups, the allow and disallow rules and the sitemap lines, then runs each path and tells you which group got selected and which rule actually won. It copies Google-style group selection and path precedence: most specific group, longest match wins, allow breaks a tie, and it honors the star wildcard and the dollar end marker. It also flags a wildcard Disallow on a public site, checks the declared sitemaps, and keeps the raw file on screen so you can compare it against what a plugin thinks it set. Remember robots controls crawling, not indexing.
Queries run through the PeopleAreGeek lookup service. We log nothing.
Live robots.txt fetch and path rule simulation
Reading a robots.txt by eye is where mistakes happen. This fetches the live file server-side, pulls out the sitemap lines and the crawler groups, runs a few paths through for whatever token you pick, and just tells you which group got picked and which allow or disallow rule actually won. No more squinting at a wall of text and hoping you read it right.
The path simulator copies Google-style group selection and path precedence for the common rules. It gets you close. For anything you really care about, still confirm with your actual crawler tools and how the live host behaves.
What a robots.txt tester should make clear
A robots.txt tester should turn a wall of directives into a plain verdict. Robots.txt looks simple. That's the trap. You skim it, you think you get it, and then you've misread a nested allow as a blanket block. A real file mixes a general User-agent: * group with a more specific crawler group, throws an allow exception inside a broader disallow, scatters a sitemap line or two around, and buries the part that actually matters under a comment. So a tester worth using has to show you both things at once: the raw file you can read line by line, and the plain verdict for the one path you came here to check.
Here's what this does. It hits the live file at the domain root through the backend, parses out the crawler groups, runs several paths in one shot, then says which rule won for the token you gave it. Honestly the moment it earns its keep is after you've touched something: a WordPress update, a theme or plugin that quietly rewrites the virtual robots output, a sitemap plugin swap, a migration. Or a Search Console report flagging some URL as blocked and you have no idea why.
Robots.txt controls crawling, not everything about indexing
A robots rule tells a well-behaved crawler whether it's allowed to request a path. That's it. It isn't a privacy wall, and it sure isn't a clean delete button for a page. Block a URL from crawling and you may have just stopped the engine from ever reading the canonical or robots meta tag sitting on that page. So for actual index cleanup, match the signal to the job: redirects when content moved, a noindex tag on stuff you'll still let them fetch but don't want indexed, the right status code when something's gone. Robots rules are for crawl access, full stop.
How crawler and path matching are read here
The simulator is built around the one decision a technical SEO actually needs answered. It grabs the most specific crawler group matching the token you typed, folds in any groups that are equally specific, then works through the allow and disallow rules. Longest match wins. When two equally specific rules fight, allow takes it. And yeah, it handles the * wildcard and the $ end marker that modern engines use when they parse robots files.
- Type the exact host you want to audit. Robots rules are bound to the host and scheme of the file that got fetched, nothing else.
- Throw a real public path at it, plus an admin or search path, plus whatever URL got reported as blocked.
- Read the winning rule. Not the rule count, the actual winner.
- Glance at the declared sitemap URLs and make sure they still parse.
- Keep the raw output on screen when you're comparing the live file against what a plugin thinks it set.
WordPress robots checks worth doing
On a public WordPress site you'll almost always see /wp-admin/ blocked with an allow carved out for /wp-admin/admin-ajax.php. Fine. That's normal, and it proves nothing about the rest of your crawl setup. Test your money pages too, the articles and tools that actually matter. Then the search or parameter patterns you meant to limit. Then sitemap discovery. And don't forget whatever your security plugin or host quietly injected when you weren't looking.
Good technical SEO habits around robots.txt
- Re-fetch the live file any time you've touched an SEO plugin, a sitemap, a cache, or done a migration.
- When the rules look split, test the exact same URL twice: once as Googlebot, once as the generic star group.
- Don't block the CSS and JS crawlers need to render a public page. Not without a really good reason, anyway.
- Keep sitemap lines absolute and current. Stale ones are easy to forget.
- Pair the robots check with an indexability and canonical look on the pages you actually want ranking.
Frequently asked questions
Does an allowed robots result guarantee indexing?
Nope. All it does is clear one doubt: the crawler can reach the thing. The URL still has to earn it. Useful content, a healthy status code, a canonical that makes sense, some internal links pointing at it, and no stray noindex undoing the whole effort.
Is a blocked path always a mistake?
Not at all. Admin pages, carts, search results, duplicate or private workflow URLs often get blocked on purpose. The real question is simpler: does blocking that path line up with what you want the site to do?
Why test several paths at once?
Because robots files love broad rules with one narrow exception buried inside. Put a public page, a blocked area, and the exception path next to each other and the pattern just clicks. Much harder to fool yourself that way.
Does robots.txt stop a page from being indexed?
No, and this one trips people up constantly. It blocks crawling, not indexing. A disallowed URL can still land in the index without a snippet if other pages link to it. Want it gone? Allow the crawl, then serve a noindex.
What is the difference between Disallow and noindex?
Disallow stops the crawl. Noindex (a meta tag or a header) tells the engine not to index. Here is the catch most people miss: block a page with disallow and the crawler never reaches the noindex you put on it, so the page you wanted gone just sits there.
Where must robots.txt live?
Right at the host root, exactly /robots.txt. Stick it in a subfolder and it is ignored, plain and simple. Oh, and every subdomain needs its own.