A robots.txt file is one of the simplest files on a web server — and one of the most dangerous to get wrong. A single line in the wrong place once blocked Google from crawling an entire site. It's not hypothetical: it has happened to major sites, costing significant organic traffic. The DevToolShack Robots.txt Generator helps you build one correctly — but understanding the rules is what keeps you safe.
What Is robots.txt?
Robots.txt is a plain text file located at the root of your domain (https://example.com/robots.txt) that tells web crawlers which pages they should and shouldn't request. It's part of the Robots Exclusion Protocol — an informal standard that all major crawlers (Googlebot, Bingbot, and hundreds of others) respect.
The Basic Syntax
User-agent: *
Disallow: /admin/
Disallow: /private/
Allow: /
Sitemap: https://example.com/sitemap.xml
The key directives:
User-agent— which crawler(s) this block applies to.*means all crawlers.Googlebottargets only Google's crawler.Disallow— paths the crawler should not visit. An empty value means "allow everything".Allow— explicitly permits a path, even if a parent is disallowed (useful for exceptions).Sitemap— tells crawlers where your XML sitemap lives. Can appear outside any user-agent block.
The Most Dangerous Mistake
This is the line that has deindexed entire sites:
User-agent: *
Disallow: /
A single slash disallows everything. This is often accidentally left from a staging environment robots.txt that gets copied to production. Google Search Console will warn you, but by then crawling may already have stopped.
Always double-check your production robots.txt at yourdomain.com/robots.txt after deployment.
What to Block (and What Not To)
Good candidates for Disallow
/admin/— admin panels and backend routes/api/— API endpoints that don't need indexing/checkout/— transactional pages with no SEO value/search?— internal search results pages (thin content)/login,/register— authentication pages/wp-admin/,/wp-login.php— WordPress admin (if applicable)/staging/— any staging or test directories
Never block these
- Your CSS and JavaScript files — Googlebot needs these to render pages correctly
- Your main content pages — obvious, but accidental path matches happen
- Your images — if you want image search traffic
- Your sitemap — it should always be accessible
Targeting Specific Crawlers
You can have multiple user-agent blocks targeting different crawlers:
# Allow Googlebot everywhere
User-agent: Googlebot
Allow: /
# Block aggressive scrapers
User-agent: AhrefsBot
Disallow: /
User-agent: SemrushBot
Disallow: /
# Default rules for everything else
User-agent: *
Disallow: /admin/
Disallow: /api/
When multiple blocks match a crawler, the most specific user-agent takes precedence over *.
The Allow Directive for Exceptions
If you disallow a directory but want to allow specific files within it:
User-agent: *
Disallow: /private/
Allow: /private/public-document.pdf
When both Apply and Disallow match, the longer (more specific) path wins. /private/public-document.pdf is longer than /private/, so the Allow takes precedence.
Wildcards
Robots.txt supports two wildcard characters:
*in a path matches any sequence of characters$at the end matches the end of the URL
# Block all URLs with query parameters
Disallow: /*?
# Block any URL ending in .pdf
Disallow: /*.pdf$
# Block any URL containing /temp/
Disallow: /*/temp/*
Verifying Your File
After publishing, verify your robots.txt in Google Search Console using the robots.txt Tester (under Settings). It shows you which pages are blocked and lets you test specific URLs against your rules. Always test before deploying changes to production.
yourdomain.com/robots.txt after every deployment.