Why Ecommerce Stores Need a robots.txt Audit
A robots.txt file tells crawlers which URLs to request and which to skip. On an ecommerce store with thousands of product, category, and faceted-filter URLs, a single misconfigured directive can block revenue-generating pages from Google's index or waste crawl budget on URLs that should never be indexed.
The 12-item checklist below gives each check a concrete pass/fail criterion. Run it quarterly, after any platform migration, and after adding a new faceted navigation or app that generates new URL patterns. Treat any 'fail' as a crawl-budget or indexation risk that requires immediate remediation.
The 12-Item robots.txt Audit Checklist
1. File is reachable at the root domain. PASS: HTTP 200 returned at https://yourdomain.com/robots.txt. FAIL: 404, 500, or redirect to another URL. A missing or redirecting robots.txt causes crawlers to assume no restrictions, exposing unintended URLs.
2. File size is under 500 KB. PASS: File is under 500 KB. FAIL: File exceeds 500 KB. Google stops parsing at 500 KB; directives below that threshold are silently ignored. Consolidate redundant rules to stay well under the limit.
3. Sitemap URL is declared. PASS: At least one 'Sitemap:' directive points to a valid XML sitemap URL. FAIL: No Sitemap: line exists. Declaring the sitemap in robots.txt ensures every crawler that reads the file can discover the full URL inventory without relying solely on Search Console submission.
4. No blanket Disallow for all bots. PASS: The wildcard user-agent (*) block does not contain 'Disallow: /'. FAIL: 'User-agent: * / Disallow: /' is present without an accompanying Allow that overrides it for key paths. This pattern โ left over from dev or staging configs โ blocks the entire site from indexing.
5. Cart and checkout paths are blocked. PASS: Directives disallow /cart, /checkout, and /order-confirmation (or platform equivalents). FAIL: These paths are crawlable. Cart and checkout pages provide zero SEO value, waste crawl budget, and can expose session-specific data in search results.
6. Account and login paths are blocked. PASS: /account, /login, /register, /wishlist, and similar authenticated-only paths are disallowed. FAIL: Any of these paths are crawlable. Indexed account pages generate duplicate content, thin-content penalties, and poor user experience when surfaced in search.
7. Faceted navigation parameters are managed. PASS: URL patterns that produce near-duplicate pages โ such as filter combinations like ?color=&size= โ are either disallowed via robots.txt path patterns or handled via canonicals with crawl-budget protection confirmed in Google Search Console coverage reports. FAIL: Filter URLs are freely crawlable with no duplicate-content mitigation in place.
8. Internal search result pages are blocked. PASS: /search, /search-results, or the platform's search query path (e.g., Shopify's /search?q=) is disallowed. FAIL: Search result pages are crawlable. These pages are dynamic, near-infinite in combination, and trigger Google's duplicate/thin-content classification.
9. Staging or dev subdomains are not governed by production robots.txt. PASS: Staging or dev environments have their own robots.txt with 'Disallow: /' and are also protected by HTTP authentication or IP restriction. FAIL: The production robots.txt references or is shared with dev/staging environments. Staging content indexed accidentally can create duplicate-content issues against live pages.
10. No critical product or category paths are accidentally disallowed. PASS: A crawl simulation (using Google Search Console's URL Inspection or a crawler tool) confirms all top-revenue category and product URL patterns return as 'Allowed'. FAIL: Any revenue-driving URL pattern matches an existing Disallow directive. This is the most financially damaging error type and requires immediate correction.
11. Googlebot-specific and other named bot directives are intentional. PASS: Every named user-agent block (Googlebot, Bingbot, AhrefsBot, etc.) has a documented reason and the rules are consistent with business intent. FAIL: Named bot blocks exist with no clear rationale, or rules contradict the wildcard block in unintended ways. Conflicting directives cause unpredictable crawler behavior.
12. robots.txt is version-controlled and change-logged. PASS: The file is stored in a version control system (Git or similar) with a commit history, or a changelog is maintained in a comment block at the top of the file. FAIL: No history exists. Without version control, diagnosing traffic drops caused by directive changes is guesswork.
Platform-Specific Patterns to Watch
Shopify automatically generates a robots.txt.liquid file. By default it blocks /cart, /checkout, /orders, and /account, but it allows /search and all collection filter URLs. Stores that rely heavily on faceted navigation โ filtering by size, color, or price โ need to evaluate whether those filter URLs should be explicitly disallowed or managed through canonical tags.
Magento and WooCommerce installations generate robots.txt via admin panels, which means a non-technical admin can overwrite directives during a routine settings change. On these platforms, confirm that access to the robots.txt editor is role-restricted and that the version-control check (item 12) is enforced through a deployment pipeline rather than manual file edits.
How to Test Each Checklist Item
For reachability and syntax (items 1โ4), fetch the raw file in a browser and run it through Google Search Console's robots.txt tester. For path-level allow/disallow checks (items 5โ11), use the URL Inspection tool in Search Console to test individual URLs against the live robots.txt, or run a site crawl with Screaming Frog and filter by 'Blocked by robots.txt' to surface unexpected disallows.
For item 12 (version control), check the repository commit history or, if no version control exists, compare a saved snapshot of the file against the current live version using a diff tool. Crawl log analysis โ available through server logs or log analysis tools โ provides the deepest confirmation that crawlers are actually respecting the directives in practice, not just in theory.
Actionable Next Step After the Audit
After completing the audit, prioritize fails in this order: (1) blanket disallow or blocked revenue paths, because these have direct indexation impact; (2) cart, checkout, and search paths that are open, because these waste crawl budget immediately; (3) version control setup, because it prevents future untracked changes from causing the same issues.
Document every change made to robots.txt with the date, the directive added or removed, and the business reason. Submit the updated sitemap URL in Google Search Console after any directive change, and monitor the Coverage report over the following two to four weeks to confirm that previously blocked pages are re-crawled and that no new 'Excluded' entries appear for pages that should be indexed.