How to Use This Crawl Budget Checklist
Crawl budget is the number of URLs Googlebot crawls on your site within a given timeframe. For ecommerce stores with thousands of product, category, filter, and pagination URLs, wasted crawl budget means important pages go unindexed while duplicate or low-value pages consume Googlebot's time instead.
Work through each of the 12 checks below in order. Mark each as Pass or Fail based on the stated criteria. Any Fail is a direct action item. Stores with more than 10,000 URLs should treat Fails in the first six checks as high-priority fixes before moving to the remaining items.
Checks 1โ4: Crawl Waste from URL Proliferation
**Check 1 โ Faceted navigation produces canonical URLs.** Pass: Every filtered URL (e.g., /shoes?color=red&size=10) either carries a canonical tag pointing to the base category URL or is blocked in robots.txt. Fail: Filter combinations generate unique, indexable URLs with no canonical or robots directive, multiplying crawlable pages by hundreds or thousands.
**Check 2 โ Internal search result pages are blocked.** Pass: All /search?q= and similar query-string URLs return a noindex meta tag or are disallowed in robots.txt. Fail: Site search results are crawlable and appear in Google Search Console's Coverage report as indexed or discovered pages.
**Check 3 โ Pagination uses rel=next/prev or load-more with canonical.** Pass: Paginated category pages (/category?page=2) signal their relationship correctly, or the paginated URLs carry a self-referencing canonical if each page deserves independent indexing. Fail: Paginated pages have no canonical, no rel attributes, and no robots directive, forcing Googlebot to crawl every page variant.
**Check 4 โ Out-of-stock and discontinued product pages are handled.** Pass: Permanently discontinued products return a 301 redirect to a relevant category or replacement product. Temporarily out-of-stock products remain indexable with structured availability markup. Fail: Discontinued products return 200 OK with thin or empty content, consuming crawl budget with no ranking value.
Checks 5โ7: Technical Signals That Affect Crawl Rate
**Check 5 โ Server response times are under 200ms for crawled URLs.** Pass: Googlebot's average download time in Search Console's Crawl Stats report is consistently below 200ms. Fail: Average download time exceeds 500ms or spikes regularly, which causes Googlebot to slow its crawl rate automatically to avoid overloading the server.
**Check 6 โ The robots.txt file disallows crawl-waste directories.** Pass: The robots.txt explicitly disallows directories like /cart/, /checkout/, /account/, /wishlist/, and any parameter-based duplicate paths. Fail: robots.txt is empty or only contains a sitemap reference, leaving all internal utility pages open to Googlebot.
**Check 7 โ 5xx error rate in Crawl Stats is below 1%.** Pass: The Server Errors section of Google Search Console's Crawl Stats shows fewer than 1% of crawl requests returning 5xx responses over any 30-day window. Fail: Repeated 5xx errors signal server instability to Googlebot, causing it to reduce crawl frequency across the entire domain.
Checks 8โ10: Sitemap and Internal Linking Quality
**Check 8 โ XML sitemap contains only indexable, canonical URLs.** Pass: Every URL in the sitemap returns a 200 status code, is not noindexed, and matches the canonical version of that URL. Fail: The sitemap includes redirected URLs, noindexed pages, or URLs with parameters that differ from their canonical counterparts โ each a direct signal to Googlebot that the sitemap is unreliable.
**Check 9 โ Sitemap is segmented by content type for stores over 10k URLs.** Pass: Separate sitemaps exist for products, categories, and blog content, each submitted individually in Search Console. Fail: A single flat sitemap contains all URL types mixed together, making it impossible to diagnose crawl patterns by content type in Search Console.
**Check 10 โ Internal links do not point to redirected or noindexed URLs.** Pass: A crawl using a tool such as Screaming Frog shows zero internal links pointing to URLs that 301 or return noindex. Fail: Navigation menus, breadcrumbs, or footer links reference old URLs that redirect, forcing Googlebot to follow redirect chains on every crawl.
Checks 11โ12: Duplicate Content and Parameter Handling
**Check 11 โ URL parameters are configured in Google Search Console or handled via canonical.** Pass: Parameters that generate duplicate content (sorting, tracking, currency switching) are either declared as non-significant in Search Console's legacy URL Parameters tool or carry canonical tags. Fail: Tracking parameters like ?utm_source= or ?ref= create crawlable duplicate URLs that appear as separate pages in Coverage reports.
**Check 12 โ Hreflang pages for international stores do not duplicate crawlable URLs without canonical clusters.** Pass: Each hreflang alternate URL is self-canonicalized and points to the correct regional variant. Fail: International subdirectories or subdomains serve near-identical content without proper canonical and hreflang pairing, doubling or tripling the crawlable URL space without corresponding indexing benefit.
Prioritizing Fixes After the Audit
Score your audit by category. Checks 1โ4 address URL proliferation โ the single largest source of crawl waste on ecommerce sites. Fix these before all others. A faceted navigation producing 50,000 parameter URLs consumes more crawl budget than every other issue on this list combined.
Checks 5โ7 require infrastructure or configuration changes that affect crawl rate directly. These take longer to implement but produce measurable improvement in Crawl Stats within 30 to 60 days of deployment. Checks 8โ12 are maintenance items that compound over time; schedule a quarterly review to prevent regression as catalogs grow and promotions add temporary URL sets.