Skip to main content
Checklist

Crawl Budget Checklist: 12 Items Every Ecommerce Store Should Audit

By ยท Updated ยท 6 min read

How to Use This Crawl Budget Checklist

Crawl budget is the number of URLs Googlebot crawls on your site within a given timeframe. For ecommerce stores with thousands of product, category, filter, and pagination URLs, wasted crawl budget means important pages go unindexed while duplicate or low-value pages consume Googlebot's time instead.

Work through each of the 12 checks below in order. Mark each as Pass or Fail based on the stated criteria. Any Fail is a direct action item. Stores with more than 10,000 URLs should treat Fails in the first six checks as high-priority fixes before moving to the remaining items.

Checks 1โ€“4: Crawl Waste from URL Proliferation

**Check 1 โ€” Faceted navigation produces canonical URLs.** Pass: Every filtered URL (e.g., /shoes?color=red&size=10) either carries a canonical tag pointing to the base category URL or is blocked in robots.txt. Fail: Filter combinations generate unique, indexable URLs with no canonical or robots directive, multiplying crawlable pages by hundreds or thousands.

**Check 2 โ€” Internal search result pages are blocked.** Pass: All /search?q= and similar query-string URLs return a noindex meta tag or are disallowed in robots.txt. Fail: Site search results are crawlable and appear in Google Search Console's Coverage report as indexed or discovered pages.

**Check 3 โ€” Pagination uses rel=next/prev or load-more with canonical.** Pass: Paginated category pages (/category?page=2) signal their relationship correctly, or the paginated URLs carry a self-referencing canonical if each page deserves independent indexing. Fail: Paginated pages have no canonical, no rel attributes, and no robots directive, forcing Googlebot to crawl every page variant.

**Check 4 โ€” Out-of-stock and discontinued product pages are handled.** Pass: Permanently discontinued products return a 301 redirect to a relevant category or replacement product. Temporarily out-of-stock products remain indexable with structured availability markup. Fail: Discontinued products return 200 OK with thin or empty content, consuming crawl budget with no ranking value.

Checks 5โ€“7: Technical Signals That Affect Crawl Rate

**Check 5 โ€” Server response times are under 200ms for crawled URLs.** Pass: Googlebot's average download time in Search Console's Crawl Stats report is consistently below 200ms. Fail: Average download time exceeds 500ms or spikes regularly, which causes Googlebot to slow its crawl rate automatically to avoid overloading the server.

**Check 6 โ€” The robots.txt file disallows crawl-waste directories.** Pass: The robots.txt explicitly disallows directories like /cart/, /checkout/, /account/, /wishlist/, and any parameter-based duplicate paths. Fail: robots.txt is empty or only contains a sitemap reference, leaving all internal utility pages open to Googlebot.

**Check 7 โ€” 5xx error rate in Crawl Stats is below 1%.** Pass: The Server Errors section of Google Search Console's Crawl Stats shows fewer than 1% of crawl requests returning 5xx responses over any 30-day window. Fail: Repeated 5xx errors signal server instability to Googlebot, causing it to reduce crawl frequency across the entire domain.

Checks 8โ€“10: Sitemap and Internal Linking Quality

**Check 8 โ€” XML sitemap contains only indexable, canonical URLs.** Pass: Every URL in the sitemap returns a 200 status code, is not noindexed, and matches the canonical version of that URL. Fail: The sitemap includes redirected URLs, noindexed pages, or URLs with parameters that differ from their canonical counterparts โ€” each a direct signal to Googlebot that the sitemap is unreliable.

**Check 9 โ€” Sitemap is segmented by content type for stores over 10k URLs.** Pass: Separate sitemaps exist for products, categories, and blog content, each submitted individually in Search Console. Fail: A single flat sitemap contains all URL types mixed together, making it impossible to diagnose crawl patterns by content type in Search Console.

**Check 10 โ€” Internal links do not point to redirected or noindexed URLs.** Pass: A crawl using a tool such as Screaming Frog shows zero internal links pointing to URLs that 301 or return noindex. Fail: Navigation menus, breadcrumbs, or footer links reference old URLs that redirect, forcing Googlebot to follow redirect chains on every crawl.

Checks 11โ€“12: Duplicate Content and Parameter Handling

**Check 11 โ€” URL parameters are configured in Google Search Console or handled via canonical.** Pass: Parameters that generate duplicate content (sorting, tracking, currency switching) are either declared as non-significant in Search Console's legacy URL Parameters tool or carry canonical tags. Fail: Tracking parameters like ?utm_source= or ?ref= create crawlable duplicate URLs that appear as separate pages in Coverage reports.

**Check 12 โ€” Hreflang pages for international stores do not duplicate crawlable URLs without canonical clusters.** Pass: Each hreflang alternate URL is self-canonicalized and points to the correct regional variant. Fail: International subdirectories or subdomains serve near-identical content without proper canonical and hreflang pairing, doubling or tripling the crawlable URL space without corresponding indexing benefit.

Prioritizing Fixes After the Audit

Score your audit by category. Checks 1โ€“4 address URL proliferation โ€” the single largest source of crawl waste on ecommerce sites. Fix these before all others. A faceted navigation producing 50,000 parameter URLs consumes more crawl budget than every other issue on this list combined.

Checks 5โ€“7 require infrastructure or configuration changes that affect crawl rate directly. These take longer to implement but produce measurable improvement in Crawl Stats within 30 to 60 days of deployment. Checks 8โ€“12 are maintenance items that compound over time; schedule a quarterly review to prevent regression as catalogs grow and promotions add temporary URL sets.

Frequently asked questions

How do I find my crawl budget data in Google Search Console?

Navigate to Settings in Google Search Console, then open Crawl Stats under the Crawling section. The report shows total crawl requests, average response time, crawl breakdown by response code and file type, and trends over 90 days. This is the primary dataset for diagnosing crawl budget issues without needing third-party tools.

Does blocking pages in robots.txt save crawl budget?

Yes. URLs disallowed in robots.txt are not fetched, which directly reduces the number of crawl requests Googlebot makes. However, blocked URLs can still appear in the index if other sites link to them. For pages that must be kept out of the index entirely, combine robots.txt disallow with a noindex meta tag or use canonical tags pointing to approved URLs.

How many URLs is too many for a standard ecommerce crawl budget?

Crawl budget scales with domain authority and server performance, so there is no universal threshold. A site with strong link equity and fast response times earns a larger crawl budget than a newer site with slow servers. The practical rule: if Search Console shows important product or category pages in Discovered โ€” currently not indexed status, crawl budget is insufficient relative to URL count.

Is crawl budget a ranking factor?

Crawl budget itself is not a ranking signal โ€” it controls whether pages get crawled and indexed, not how they rank once indexed. The indirect effect is significant: a product page that Googlebot never crawls cannot rank. For large ecommerce catalogs where new products launch frequently, ensuring those pages are crawled promptly is a prerequisite for any ranking.

Should every ecommerce store worry about crawl budget?

Stores with fewer than 1,000 URLs and fast servers rarely face crawl budget constraints. The issue becomes material at scale โ€” typically stores with 10,000 or more URLs, frequent catalog changes, faceted navigation, or multiple language/currency variants. If all your important pages are indexed promptly, crawl budget optimization is low priority compared to content quality and link acquisition.

MG
Written by

Matt is the founder of RunOctopus. He built All Angles Creatures from zero to page-1 rankings in reptile feeder insects in under 60 days using exactly this method โ€” turning a hard, entrenched niche into RunOctopus's proof store for programmatic SEO and AI search citation.

Connect on LinkedIn →

See what Otto would build for your store

Free architecture preview. No card required. Five minutes.

Generate Preview →