Why robots.txt Is the Gatekeeper for AI Citations
Your robots.txt file is a plain text document at the root of your domain that tells crawlers what they can and cannot access. Every AI search engine โ ChatGPT Search, Claude, Perplexity, Gemini โ sends a crawler to your site before it can cite your content. If your robots.txt blocks that crawler, your store is invisible to that AI surface. No exceptions. No workarounds. The crawler respects the directive and moves on.
Many ecommerce stores unknowingly block AI crawlers. Default CDN settings often classify unfamiliar user agents as bots to be blocked. Overly broad robots.txt rules written years ago โ before AI search existed โ may Disallow all non-Google crawlers. WAF bot-blocking features designed to stop scrapers catch AI crawlers in the same net. This is the single most common reason stores are invisible to AI search, and it takes two minutes to fix once you know where to look.
The stakes are concrete: if your competitor's store allows AI crawlers and yours blocks them, your competitor gets cited when a shopper asks ChatGPT or Perplexity a question about your shared product category. You do not. The fix is not a strategy overhaul or a content rewrite โ it is a configuration change in one file.
The Six AI Crawlers You Need to Know
Six crawlers determine whether your store appears in AI-powered search results. Each operates independently, each respects robots.txt separately, and each powers a different AI surface. Blocking one does not affect the others โ but each one you block is an AI discovery channel you forfeit.
GPTBot is OpenAI's general-purpose crawler used for training data collection and content indexing. OAI-SearchBot is OpenAI's real-time search crawler โ this is the one that fetches pages live when a user asks ChatGPT Search a question. ClaudeBot is Anthropic's crawler for Claude's web search feature. PerplexityBot is Perplexity's crawler for its real-time retrieval and shopping results. Google-Extended is Google's crawler specifically for AI features including AI Overviews and Gemini โ it is separate from Googlebot, which handles traditional search indexing. Bingbot is already allowed on most sites and powers Bing Copilot and Bing Chat.
The critical distinction: Googlebot and Google-Extended are independent. You can rank in Google's traditional search results while being completely invisible to Google's AI Overviews if you block Google-Extended. The same content, the same domain, two different crawlers โ and blocking one does not affect the other.
The Recommended robots.txt Configuration
The simplest approach is to ensure your robots.txt does not block any AI crawlers. If your file currently has a blanket User-agent: * / Disallow: with nothing after the Disallow, you are already allowing everything โ including AI crawlers. No changes needed. The problem arises when robots.txt has specific Disallow rules under User-agent: * that inadvertently catch AI crawlers, or when there are explicit blocks like User-agent: GPTBot / Disallow: /.
The recommended configuration adds explicit Allow directives for each AI crawler. Even if your wildcard rule already permits access, explicit entries make the intent clear and protect against future changes that might inadvertently block them. Add these entries: User-agent: GPTBot / Allow: /, User-agent: OAI-SearchBot / Allow: /, User-agent: ClaudeBot / Allow: /, User-agent: PerplexityBot / Allow: /, and User-agent: Google-Extended / Allow: /. Bingbot is typically already allowed.
Platform specifics matter. On Shopify, robots.txt is controlled through the robots.txt.liquid template in your theme code. On WooCommerce, you can edit the physical robots.txt file in your WordPress root directory or use the Yoast SEO plugin's robots.txt editor. On Wix, the robots.txt editor lives under Settings then SEO then robots.txt. After any edit, verify the result by visiting yourstore.com/robots.txt in a browser.
Platform-Specific Instructions
Shopify: Navigate to Online Store, then Themes, then click Actions and Edit Code. Look for robots.txt.liquid in the Templates section. If it does not exist, create it. Shopify generates a default robots.txt that allows Googlebot but does not explicitly address AI crawlers. Add User-agent and Allow lines for GPTBot, OAI-SearchBot, ClaudeBot, PerplexityBot, and Google-Extended. Save the file and verify at yourstore.myshopify.com/robots.txt. Changes take effect immediately โ no deploy or cache purge required. For detailed Shopify SEO setup, see the full guide.
WooCommerce: You have two options. Option one: edit the robots.txt file directly in your WordPress installation's root directory via FTP or your hosting file manager. Option two: if you use Yoast SEO, go to SEO, then Tools, then File Editor, then robots.txt. Add the User-agent and Allow lines. If neither option works because your server uses virtual robots.txt rules, check your WordPress Settings then Reading and ensure "Discourage search engines" is unchecked, then add a physical robots.txt file. Full WooCommerce SEO walkthrough here.
Wix: Go to your site dashboard, then Settings, then SEO (Google), then robots.txt Editor. Wix provides a text editor where you can add custom directives. Add the User-agent and Allow lines for each AI crawler. Note that Wix's robots.txt editor may prepend its own default rules โ make sure your AI crawler Allow directives are not overridden by a later Disallow. Verify at yoursite.com/robots.txt after saving. See the full Wix SEO guide for additional configuration.
CDN and WAF Gotchas
A correct robots.txt is necessary but not sufficient. Network-level protections can block AI crawlers before they ever read your robots.txt file. Cloudflare's Bot Fight Mode and Super Bot Fight Mode are the most common culprits โ they classify unfamiliar user agents as hostile bots and serve them challenge pages or outright blocks. AI crawlers hitting a Cloudflare challenge page cannot read your content and will not retry. Your robots.txt says "welcome" but the door is bolted shut before they reach it.
AWS WAF bot-control managed rules may classify AI user agents as "unverified bots" and block them at the edge. Vercel's edge middleware can reject unfamiliar user agents if custom middleware logic checks for a known-bot allowlist. Sucuri, Wordfence, and other WordPress security plugins have bot-blocking features that may catch AI crawlers. Even some hosting providers apply bot-mitigation rules at the server level that you cannot see in your own configuration.
The diagnostic process: if your robots.txt allows AI crawlers but you see zero AI crawler visits in your server logs after two weeks, a network-level block is almost certainly the cause. Check your CDN's bot management settings and look for blocked requests from GPTBot, ClaudeBot, or PerplexityBot user agents. In Cloudflare specifically, navigate to Security, then Bots, and check whether Bot Fight Mode is enabled. If it is, you may need to create a WAF custom rule that explicitly allows AI crawler user agents before the bot-fight rule catches them.
How to Verify AI Crawlers Are Reaching Your Pages
Allowing crawlers is step one. Verifying they are actually visiting is step two. Check your server access logs for user agent strings containing GPTBot, OAI-SearchBot, ClaudeBot, or PerplexityBot. In Cloudflare, go to Analytics, then Bot Traffic to see bot classification and request counts. In Vercel, check Runtime Logs and filter by user agent. In AWS, CloudFront access logs contain the full user agent string โ search for the AI crawler names.
What to look for: if you see AI crawler visits to your homepage but not to deeper content pages, your internal linking or sitemap may need work. AI crawlers discover pages by following links from pages they have already crawled โ if your content pages are orphaned (no internal links pointing to them), crawlers may never find them. If you see visits to many pages, you are in good shape โ the crawlers are doing their job and your content is eligible for citation.
If you see zero AI crawler hits after seven days with a confirmed-correct robots.txt: revisit the CDN and WAF section above. If you see crawler visits but your store still is not being cited in AI search results, the issue is likely content quality, specificity, or schema markup โ not access. The crawlers can reach your pages but the content is not compelling enough for the AI to cite. That is a different problem with a different solution.
The Selective Access Strategy
Some store operators want AI crawlers to access their content pages โ blog posts, guides, educational content โ but not their product pages. The concern is price scraping or competitive intelligence leakage. This is a valid configuration: Allow AI crawlers on /blog/, /guides/, /pages/, and content directories while Disallowing them on /products/ or /collections/. The robots.txt syntax supports this path-level granularity per user agent.
However, blocking product pages from AI crawlers means AI search cannot cite your product information. When a shopper asks Perplexity "what is the best ceramic cookware set under $200" and your product page has a relevant answer, Perplexity cannot cite it if PerplexityBot is blocked from /products/. Your content pages may still get cited for informational queries, but commercial-intent queries โ the ones closest to a purchase decision โ will go to competitors whose product pages are accessible.
For most ecommerce stores, full Allow is the better strategy. The citation value of having AI surfaces reference your product pages, pricing, and specifications far exceeds any marginal scraping risk. Your prices are already visible to anyone with a browser โ AI crawlers are not revealing information that is not already public. The calculus changes only if you have genuinely proprietary content behind paywalls or if you have specific licensing objections to a particular AI company's training practices. For the majority of stores, the recommendation is clear: allow everything and let AI search optimization work in your favor.