The Shack Web Development How to Build a Robots.txt File

How to Build a Robots.txt File That Won't Accidentally Block Google

Back to All Posts

A robots.txt file is one of the simplest files on a web server — and one of the most dangerous to get wrong. A single line in the wrong place once blocked Google from crawling an entire site. It's not hypothetical: it has happened to major sites, costing significant organic traffic. The DevToolShack Robots.txt Generator helps you build one correctly — but understanding the rules is what keeps you safe.

What Is robots.txt?

Robots.txt is a plain text file located at the root of your domain (https://example.com/robots.txt) that tells web crawlers which pages they should and shouldn't request. It's part of the Robots Exclusion Protocol — an informal standard that all major crawlers (Googlebot, Bingbot, and hundreds of others) respect.

Robots.txt is a request, not a security control. Compliant crawlers honour it — malicious scrapers do not. Never use robots.txt to "protect" sensitive content. Use authentication for that. Robots.txt is for managing crawl budget and keeping irrelevant pages out of search indexes.

The Basic Syntax

User-agent: *
Disallow: /admin/
Disallow: /private/
Allow: /

Sitemap: https://example.com/sitemap.xml

The key directives:

  • User-agent — which crawler(s) this block applies to. * means all crawlers. Googlebot targets only Google's crawler.
  • Disallow — paths the crawler should not visit. An empty value means "allow everything".
  • Allow — explicitly permits a path, even if a parent is disallowed (useful for exceptions).
  • Sitemap — tells crawlers where your XML sitemap lives. Can appear outside any user-agent block.

The Most Dangerous Mistake

This is the line that has deindexed entire sites:

User-agent: *
Disallow: /

A single slash disallows everything. This is often accidentally left from a staging environment robots.txt that gets copied to production. Google Search Console will warn you, but by then crawling may already have stopped.

Always double-check your production robots.txt at yourdomain.com/robots.txt after deployment.

What to Block (and What Not To)

Good candidates for Disallow

  • /admin/ — admin panels and backend routes
  • /api/ — API endpoints that don't need indexing
  • /checkout/ — transactional pages with no SEO value
  • /search? — internal search results pages (thin content)
  • /login, /register — authentication pages
  • /wp-admin/, /wp-login.php — WordPress admin (if applicable)
  • /staging/ — any staging or test directories

Never block these

  • Your CSS and JavaScript files — Googlebot needs these to render pages correctly
  • Your main content pages — obvious, but accidental path matches happen
  • Your images — if you want image search traffic
  • Your sitemap — it should always be accessible

Targeting Specific Crawlers

You can have multiple user-agent blocks targeting different crawlers:

# Allow Googlebot everywhere
User-agent: Googlebot
Allow: /

# Block aggressive scrapers
User-agent: AhrefsBot
Disallow: /

User-agent: SemrushBot
Disallow: /

# Default rules for everything else
User-agent: *
Disallow: /admin/
Disallow: /api/

When multiple blocks match a crawler, the most specific user-agent takes precedence over *.

The Allow Directive for Exceptions

If you disallow a directory but want to allow specific files within it:

User-agent: *
Disallow: /private/
Allow: /private/public-document.pdf

When both Apply and Disallow match, the longer (more specific) path wins. /private/public-document.pdf is longer than /private/, so the Allow takes precedence.

Wildcards

Robots.txt supports two wildcard characters:

  • * in a path matches any sequence of characters
  • $ at the end matches the end of the URL
# Block all URLs with query parameters
Disallow: /*?

# Block any URL ending in .pdf
Disallow: /*.pdf$

# Block any URL containing /temp/
Disallow: /*/temp/*

Verifying Your File

After publishing, verify your robots.txt in Google Search Console using the robots.txt Tester (under Settings). It shows you which pages are blocked and lets you test specific URLs against your rules. Always test before deploying changes to production.

Generate and verify: The Robots.txt Generator builds a correctly formatted file from a simple form — add user agents, disallow paths, and add your sitemap URL. Combine it with the Meta Tag Generator for a complete on-page SEO setup. And always test at yourdomain.com/robots.txt after every deployment.