Skip to main content
Technical

robots.txt for AI Crawlers: The Complete Setup Guide

By ยท Updated ยท 8 min read

Why robots.txt Is the Gatekeeper for AI Citations

Your robots.txt file is a plain text document at the root of your domain that tells crawlers what they can and cannot access. Every AI search engine โ€” ChatGPT Search, Claude, Perplexity, Gemini โ€” sends a crawler to your site before it can cite your content. If your robots.txt blocks that crawler, your store is invisible to that AI surface. No exceptions. No workarounds. The crawler respects the directive and moves on.

Many ecommerce stores unknowingly block AI crawlers. Default CDN settings often classify unfamiliar user agents as bots to be blocked. Overly broad robots.txt rules written years ago โ€” before AI search existed โ€” may Disallow all non-Google crawlers. WAF bot-blocking features designed to stop scrapers catch AI crawlers in the same net. This is the single most common reason stores are invisible to AI search, and it takes two minutes to fix once you know where to look.

The stakes are concrete: if your competitor's store allows AI crawlers and yours blocks them, your competitor gets cited when a shopper asks ChatGPT or Perplexity a question about your shared product category. You do not. The fix is not a strategy overhaul or a content rewrite โ€” it is a configuration change in one file.

The Six AI Crawlers You Need to Know

Six crawlers determine whether your store appears in AI-powered search results. Each operates independently, each respects robots.txt separately, and each powers a different AI surface. Blocking one does not affect the others โ€” but each one you block is an AI discovery channel you forfeit.

GPTBot is OpenAI's general-purpose crawler used for training data collection and content indexing. OAI-SearchBot is OpenAI's real-time search crawler โ€” this is the one that fetches pages live when a user asks ChatGPT Search a question. ClaudeBot is Anthropic's crawler for Claude's web search feature. PerplexityBot is Perplexity's crawler for its real-time retrieval and shopping results. Google-Extended is Google's crawler specifically for AI features including AI Overviews and Gemini โ€” it is separate from Googlebot, which handles traditional search indexing. Bingbot is already allowed on most sites and powers Bing Copilot and Bing Chat.

The critical distinction: Googlebot and Google-Extended are independent. You can rank in Google's traditional search results while being completely invisible to Google's AI Overviews if you block Google-Extended. The same content, the same domain, two different crawlers โ€” and blocking one does not affect the other.

AI Crawler Directory Table showing the six AI crawlers that read robots.txt: GPTBot, OAI-SearchBot, ClaudeBot, PerplexityBot, Google-Extended, and Bingbot โ€” with the organization behind each and what it powers CRAWLER ORGANIZATION POWERS GPTBot OpenAI ChatGPT training + indexing OAI-SearchBot OpenAI ChatGPT Search live queries ClaudeBot Anthropic Claude web search PerplexityBot Perplexity Perplexity search + shopping Google-Extended Google AI Overviews + Gemini Bingbot Microsoft Bing Copilot + Bing Chat
The six AI crawlers that determine whether your store can be cited in AI search โ€” each respects robots.txt independently

The Recommended robots.txt Configuration

The simplest approach is to ensure your robots.txt does not block any AI crawlers. If your file currently has a blanket User-agent: * / Disallow: with nothing after the Disallow, you are already allowing everything โ€” including AI crawlers. No changes needed. The problem arises when robots.txt has specific Disallow rules under User-agent: * that inadvertently catch AI crawlers, or when there are explicit blocks like User-agent: GPTBot / Disallow: /.

The recommended configuration adds explicit Allow directives for each AI crawler. Even if your wildcard rule already permits access, explicit entries make the intent clear and protect against future changes that might inadvertently block them. Add these entries: User-agent: GPTBot / Allow: /, User-agent: OAI-SearchBot / Allow: /, User-agent: ClaudeBot / Allow: /, User-agent: PerplexityBot / Allow: /, and User-agent: Google-Extended / Allow: /. Bingbot is typically already allowed.

Platform specifics matter. On Shopify, robots.txt is controlled through the robots.txt.liquid template in your theme code. On WooCommerce, you can edit the physical robots.txt file in your WordPress root directory or use the Yoast SEO plugin's robots.txt editor. On Wix, the robots.txt editor lives under Settings then SEO then robots.txt. After any edit, verify the result by visiting yourstore.com/robots.txt in a browser.

Platform-Specific Instructions

Shopify: Navigate to Online Store, then Themes, then click Actions and Edit Code. Look for robots.txt.liquid in the Templates section. If it does not exist, create it. Shopify generates a default robots.txt that allows Googlebot but does not explicitly address AI crawlers. Add User-agent and Allow lines for GPTBot, OAI-SearchBot, ClaudeBot, PerplexityBot, and Google-Extended. Save the file and verify at yourstore.myshopify.com/robots.txt. Changes take effect immediately โ€” no deploy or cache purge required. For detailed Shopify SEO setup, see the full guide.

WooCommerce: You have two options. Option one: edit the robots.txt file directly in your WordPress installation's root directory via FTP or your hosting file manager. Option two: if you use Yoast SEO, go to SEO, then Tools, then File Editor, then robots.txt. Add the User-agent and Allow lines. If neither option works because your server uses virtual robots.txt rules, check your WordPress Settings then Reading and ensure "Discourage search engines" is unchecked, then add a physical robots.txt file. Full WooCommerce SEO walkthrough here.

Wix: Go to your site dashboard, then Settings, then SEO (Google), then robots.txt Editor. Wix provides a text editor where you can add custom directives. Add the User-agent and Allow lines for each AI crawler. Note that Wix's robots.txt editor may prepend its own default rules โ€” make sure your AI crawler Allow directives are not overridden by a later Disallow. Verify at yoursite.com/robots.txt after saving. See the full Wix SEO guide for additional configuration.

CDN and WAF Gotchas

A correct robots.txt is necessary but not sufficient. Network-level protections can block AI crawlers before they ever read your robots.txt file. Cloudflare's Bot Fight Mode and Super Bot Fight Mode are the most common culprits โ€” they classify unfamiliar user agents as hostile bots and serve them challenge pages or outright blocks. AI crawlers hitting a Cloudflare challenge page cannot read your content and will not retry. Your robots.txt says "welcome" but the door is bolted shut before they reach it.

AWS WAF bot-control managed rules may classify AI user agents as "unverified bots" and block them at the edge. Vercel's edge middleware can reject unfamiliar user agents if custom middleware logic checks for a known-bot allowlist. Sucuri, Wordfence, and other WordPress security plugins have bot-blocking features that may catch AI crawlers. Even some hosting providers apply bot-mitigation rules at the server level that you cannot see in your own configuration.

The diagnostic process: if your robots.txt allows AI crawlers but you see zero AI crawler visits in your server logs after two weeks, a network-level block is almost certainly the cause. Check your CDN's bot management settings and look for blocked requests from GPTBot, ClaudeBot, or PerplexityBot user agents. In Cloudflare specifically, navigate to Security, then Bots, and check whether Bot Fight Mode is enabled. If it is, you may need to create a WAF custom rule that explicitly allows AI crawler user agents before the bot-fight rule catches them.

How to Verify AI Crawlers Are Reaching Your Pages

Allowing crawlers is step one. Verifying they are actually visiting is step two. Check your server access logs for user agent strings containing GPTBot, OAI-SearchBot, ClaudeBot, or PerplexityBot. In Cloudflare, go to Analytics, then Bot Traffic to see bot classification and request counts. In Vercel, check Runtime Logs and filter by user agent. In AWS, CloudFront access logs contain the full user agent string โ€” search for the AI crawler names.

What to look for: if you see AI crawler visits to your homepage but not to deeper content pages, your internal linking or sitemap may need work. AI crawlers discover pages by following links from pages they have already crawled โ€” if your content pages are orphaned (no internal links pointing to them), crawlers may never find them. If you see visits to many pages, you are in good shape โ€” the crawlers are doing their job and your content is eligible for citation.

If you see zero AI crawler hits after seven days with a confirmed-correct robots.txt: revisit the CDN and WAF section above. If you see crawler visits but your store still is not being cited in AI search results, the issue is likely content quality, specificity, or schema markup โ€” not access. The crawlers can reach your pages but the content is not compelling enough for the AI to cite. That is a different problem with a different solution.

The Selective Access Strategy

Some store operators want AI crawlers to access their content pages โ€” blog posts, guides, educational content โ€” but not their product pages. The concern is price scraping or competitive intelligence leakage. This is a valid configuration: Allow AI crawlers on /blog/, /guides/, /pages/, and content directories while Disallowing them on /products/ or /collections/. The robots.txt syntax supports this path-level granularity per user agent.

However, blocking product pages from AI crawlers means AI search cannot cite your product information. When a shopper asks Perplexity "what is the best ceramic cookware set under $200" and your product page has a relevant answer, Perplexity cannot cite it if PerplexityBot is blocked from /products/. Your content pages may still get cited for informational queries, but commercial-intent queries โ€” the ones closest to a purchase decision โ€” will go to competitors whose product pages are accessible.

For most ecommerce stores, full Allow is the better strategy. The citation value of having AI surfaces reference your product pages, pricing, and specifications far exceeds any marginal scraping risk. Your prices are already visible to anyone with a browser โ€” AI crawlers are not revealing information that is not already public. The calculus changes only if you have genuinely proprietary content behind paywalls or if you have specific licensing objections to a particular AI company's training practices. For the majority of stores, the recommendation is clear: allow everything and let AI search optimization work in your favor.

Frequently asked questions

Will allowing AI crawlers slow down my site?

No measurable impact. AI crawlers respect crawl-delay directives and typically make far fewer requests than Googlebot. A standard ecommerce store receives 10 to 50 AI crawler requests per day โ€” negligible compared to the thousands from Googlebot and other bots.

Can I allow some AI crawlers but block others?

Yes. Each User-agent entry in robots.txt is independent. You can Allow GPTBot and ClaudeBot while Disallowing others. However, each blocked crawler removes you from that AI surface's citation pool. There is rarely a good reason to selectively block unless you have specific licensing concerns with a particular AI company.

Does Shopify let me edit robots.txt?

Yes. Edit the robots.txt.liquid file in your theme code: Online Store, then Themes, then Edit code, then search for robots.txt.liquid in the Templates section. Add User-agent and Allow directives for each AI crawler you want to permit. Changes take effect immediately.

What if I block AI crawlers and change my mind later?

Unblock them and AI surfaces will begin crawling within days. There is no penalty for previously blocking โ€” once access is restored, your pages become eligible for citation immediately. The only cost is the time you were invisible.

Should I add AI crawlers to my sitemap submission?

AI crawlers discover pages through their own crawling, not through sitemap submissions unlike Google Search Console. However, a well-structured sitemap with accurate lastmod dates helps all crawlers โ€” including AI ones โ€” find and prioritize your freshest content. Keep your sitemap current regardless.

MG
Written by

Matt is the founder of RunOctopus. He built All Angles Creatures from zero to page-1 rankings in reptile feeder insects in under 60 days using exactly this method โ€” turning a hard, entrenched niche into RunOctopus's proof store for programmatic SEO and AI search citation.

Connect on LinkedIn →

See what Otto would build for your store

Free architecture preview. No card required. Five minutes.

Generate Preview →