Skip to main content
Checklist

robots.txt Checklist: 12 Items Every Ecommerce Store Should Audit

By ยท Updated ยท 7 min read

Why Ecommerce Stores Need a robots.txt Audit

A robots.txt file tells crawlers which URLs to request and which to skip. On an ecommerce store with thousands of product, category, and faceted-filter URLs, a single misconfigured directive can block revenue-generating pages from Google's index or waste crawl budget on URLs that should never be indexed.

The 12-item checklist below gives each check a concrete pass/fail criterion. Run it quarterly, after any platform migration, and after adding a new faceted navigation or app that generates new URL patterns. Treat any 'fail' as a crawl-budget or indexation risk that requires immediate remediation.

The 12-Item robots.txt Audit Checklist

1. File is reachable at the root domain. PASS: HTTP 200 returned at https://yourdomain.com/robots.txt. FAIL: 404, 500, or redirect to another URL. A missing or redirecting robots.txt causes crawlers to assume no restrictions, exposing unintended URLs.

2. File size is under 500 KB. PASS: File is under 500 KB. FAIL: File exceeds 500 KB. Google stops parsing at 500 KB; directives below that threshold are silently ignored. Consolidate redundant rules to stay well under the limit.

3. Sitemap URL is declared. PASS: At least one 'Sitemap:' directive points to a valid XML sitemap URL. FAIL: No Sitemap: line exists. Declaring the sitemap in robots.txt ensures every crawler that reads the file can discover the full URL inventory without relying solely on Search Console submission.

4. No blanket Disallow for all bots. PASS: The wildcard user-agent (*) block does not contain 'Disallow: /'. FAIL: 'User-agent: * / Disallow: /' is present without an accompanying Allow that overrides it for key paths. This pattern โ€” left over from dev or staging configs โ€” blocks the entire site from indexing.

5. Cart and checkout paths are blocked. PASS: Directives disallow /cart, /checkout, and /order-confirmation (or platform equivalents). FAIL: These paths are crawlable. Cart and checkout pages provide zero SEO value, waste crawl budget, and can expose session-specific data in search results.

6. Account and login paths are blocked. PASS: /account, /login, /register, /wishlist, and similar authenticated-only paths are disallowed. FAIL: Any of these paths are crawlable. Indexed account pages generate duplicate content, thin-content penalties, and poor user experience when surfaced in search.

7. Faceted navigation parameters are managed. PASS: URL patterns that produce near-duplicate pages โ€” such as filter combinations like ?color=&size= โ€” are either disallowed via robots.txt path patterns or handled via canonicals with crawl-budget protection confirmed in Google Search Console coverage reports. FAIL: Filter URLs are freely crawlable with no duplicate-content mitigation in place.

8. Internal search result pages are blocked. PASS: /search, /search-results, or the platform's search query path (e.g., Shopify's /search?q=) is disallowed. FAIL: Search result pages are crawlable. These pages are dynamic, near-infinite in combination, and trigger Google's duplicate/thin-content classification.

9. Staging or dev subdomains are not governed by production robots.txt. PASS: Staging or dev environments have their own robots.txt with 'Disallow: /' and are also protected by HTTP authentication or IP restriction. FAIL: The production robots.txt references or is shared with dev/staging environments. Staging content indexed accidentally can create duplicate-content issues against live pages.

10. No critical product or category paths are accidentally disallowed. PASS: A crawl simulation (using Google Search Console's URL Inspection or a crawler tool) confirms all top-revenue category and product URL patterns return as 'Allowed'. FAIL: Any revenue-driving URL pattern matches an existing Disallow directive. This is the most financially damaging error type and requires immediate correction.

11. Googlebot-specific and other named bot directives are intentional. PASS: Every named user-agent block (Googlebot, Bingbot, AhrefsBot, etc.) has a documented reason and the rules are consistent with business intent. FAIL: Named bot blocks exist with no clear rationale, or rules contradict the wildcard block in unintended ways. Conflicting directives cause unpredictable crawler behavior.

12. robots.txt is version-controlled and change-logged. PASS: The file is stored in a version control system (Git or similar) with a commit history, or a changelog is maintained in a comment block at the top of the file. FAIL: No history exists. Without version control, diagnosing traffic drops caused by directive changes is guesswork.

Platform-Specific Patterns to Watch

Shopify automatically generates a robots.txt.liquid file. By default it blocks /cart, /checkout, /orders, and /account, but it allows /search and all collection filter URLs. Stores that rely heavily on faceted navigation โ€” filtering by size, color, or price โ€” need to evaluate whether those filter URLs should be explicitly disallowed or managed through canonical tags.

Magento and WooCommerce installations generate robots.txt via admin panels, which means a non-technical admin can overwrite directives during a routine settings change. On these platforms, confirm that access to the robots.txt editor is role-restricted and that the version-control check (item 12) is enforced through a deployment pipeline rather than manual file edits.

How to Test Each Checklist Item

For reachability and syntax (items 1โ€“4), fetch the raw file in a browser and run it through Google Search Console's robots.txt tester. For path-level allow/disallow checks (items 5โ€“11), use the URL Inspection tool in Search Console to test individual URLs against the live robots.txt, or run a site crawl with Screaming Frog and filter by 'Blocked by robots.txt' to surface unexpected disallows.

For item 12 (version control), check the repository commit history or, if no version control exists, compare a saved snapshot of the file against the current live version using a diff tool. Crawl log analysis โ€” available through server logs or log analysis tools โ€” provides the deepest confirmation that crawlers are actually respecting the directives in practice, not just in theory.

Actionable Next Step After the Audit

After completing the audit, prioritize fails in this order: (1) blanket disallow or blocked revenue paths, because these have direct indexation impact; (2) cart, checkout, and search paths that are open, because these waste crawl budget immediately; (3) version control setup, because it prevents future untracked changes from causing the same issues.

Document every change made to robots.txt with the date, the directive added or removed, and the business reason. Submit the updated sitemap URL in Google Search Console after any directive change, and monitor the Coverage report over the following two to four weeks to confirm that previously blocked pages are re-crawled and that no new 'Excluded' entries appear for pages that should be indexed.

Frequently asked questions

How often should an ecommerce store audit its robots.txt file?

Run a full audit quarterly and immediately after any platform migration, theme update, or addition of a new app that generates URL patterns. Robots.txt errors are silent โ€” crawlers do not send alerts when directives change โ€” so scheduled audits are the only reliable way to catch accidental blocks before they affect organic traffic and revenue.

Does blocking a URL in robots.txt remove it from Google's index?

No. Disallowing a URL stops Googlebot from crawling it, but if external links point to that URL, Google can still index it as a URL-only listing with no content snippet. To remove a URL from the index entirely, use a noindex meta tag on the page itself or submit a removal request through Google Search Console. Robots.txt and noindex serve different functions.

What is the biggest robots.txt mistake ecommerce stores make?

The most damaging mistake is an accidental 'Disallow: /' applied to all user-agents, which blocks the entire store from crawling. This frequently occurs when a staging-environment robots.txt is copied to production during a site migration. The second most damaging mistake is blocking faceted navigation URLs that carry unique, rankable long-tail keyword intent.

Should ecommerce stores block Googlebot from crawling out-of-stock product pages?

No. Blocking out-of-stock pages in robots.txt removes them from the index permanently, even when stock returns. The correct approach is to keep these pages crawlable and indexed, use structured data to signal availability, and implement internal linking that deprioritizes permanently discontinued products. Robots.txt blocks are too blunt for dynamic inventory management.

Can a robots.txt file slow down a site or affect page speed?

The robots.txt file itself has no effect on page speed for end users. However, an oversized robots.txt (above 500 KB) causes Google to stop parsing midway, effectively ignoring all directives below the cutoff point. Keep the file lean by consolidating overlapping rules and removing directives for URL patterns that no longer exist on the site.

MG
Written by

Matt is the founder of RunOctopus. He built All Angles Creatures from zero to page-1 rankings in reptile feeder insects in under 60 days using exactly this method โ€” turning a hard, entrenched niche into RunOctopus's proof store for programmatic SEO and AI search citation.

Connect on LinkedIn →

See what Otto would build for your store

Free architecture preview. No card required. Five minutes.

Generate Preview →