Technical SEO for AI: Robots.txt, GPTBot & llms.txt Explained | 3.4. AEO Course by Ahrefs
Chapters7
Explain how robots.txt controls AI crawlers and which bots to consider, plus how Cloudflare can block AI traffic by default.
Six practical checks (robots.txt, JS rendering, speed, structure, schema, and 404 optimization) to boost AI visibility and reduce hallucinated URLs, explained by Ahrefs’ Sam O.
Summary
Sam O walks viewers through six actionable checks to improve AI access to your content, without requiring code rewrites. He emphasizes that many sites unknowingly block AI crawlers, citing 5.9% of 140 million sites blocking GPTBot. The lesson covers robots.txt nuances, including blocks set by Cloudflare’s default AI traffic feature, and how to verify your site isn’t unintentionally hiding from GPT-based crawlers. He also introduces the concept of LLM.txt, noting that major providers haven’t adopted it yet, so it’s not something to rely on today. JavaScript rendering matters because some AI crawlers can render JS while ChatGPT’s crawler cannot, so server-side rendering can be a crucial fix. Page speed remains important for AI retrieval, as real-time fetching and parsing can drop slow pages before scoring. A clean HTML structure, with a logical heading hierarchy and atomic content, helps AI parse and extract the right information. Schema markup is discussed, with mixed evidence on direct AI benefits but no harm if you already use it. Finally, he warns about AI hallucinated URLs and shows how to redirect or fix 404s for pages that AI references, potentially reclaiming lost traffic. The episode ends by hinting at upcoming coverage on measuring and tracking AI visibility.Overall, the guidance ties directly to improving content parsability for AI systems and aligns with SEO fundamentals already familiar to Ahrefs users.
Key Takeaways
- Check robots.txt for blocks against GPTBot, Cloudbot, Google Extended, and OAI Searchbot; remove or adjust Disallow lines to ensure AI crawlers can access content.
- Understand that LLM.txt is not yet adopted by major providers, so prioritize robots.txt and on-page fundamentals for now.
- If content relies on JavaScript, implement server-side rendering or test rendering by disabling JS to confirm AI visibility.
- Speed matters for AI retrieval; optimize Core Web Vitals and maintain fast, clean HTML to benefit AI as well as Google.
- Structure your HTML with a clear heading hierarchy (H1, H2, H3) and atomic content to aid AI extraction and parsing.
- Schema markup can help, but there’s no conclusive proof it boosts AI citations; keep it if you already use it for SEO.
- Guard against AI hallucinations by monitoring analytics for AI referer traffic hitting 404s and set redirects to relevant real pages.
Who Is This For?
This is essential viewing for SEO professionals and developers who want their content found by AI crawlers (GPT-based copilots, ChatGPT, etc.). It’s particularly valuable if you rely on AI-assisted content discovery or want to ensure high AI visibility without reworking site code.
Notable Quotes
"This lesson isn't about rewriting your site's code."
—Sam O clarifies that the focus is on accessibility for AI crawlers, not code changes.
"5.9% of 140 million websites are blocking GPTBot, OpenAI's crawler."
—Illustrates the size of the issue and why this topic matters.
"Cloudflare has a feature called instruct AI bot traffic with robots.txt. That's now enabled by default."
—Highlights a concrete default setting that could block AI crawlers unintentionally.
"LLM.txt is a proposed standard, but no major LLM provider officially supports it yet."
—Sets realistic expectations about the current relevance of LLM.txt.
"ChatGPT literally can't see your content."
—Emphasizes the importance of server-side rendering for AI visibility.
Questions This Video Answers
- How do I configure robots.txt to allow GPTBot and other AI crawlers?
- What is LLM.txt and should I bother creating one right now?
- What steps fix AI hallucinated URLs and reduce 404 errors for AI referrals?
- Does schema markup help AI visibility, and which types are worth implementing?
- How can I test if my site is rendering correctly for AI crawlers without breaking existing SEO?
Robots.txtGPTBotllm.txtJavaScript renderingServer-side renderingCore Web VitalsSchema.orgAI hallucinated URLsAI visibilityAhrefs AEO
Full Transcript
Hey, it's Sam O and welcome to the fourth lesson in this module, which is on the technical side of AEO. Now, I know technical can sound intimidating, but this lesson isn't about rewriting your site's code. It's about making sure that AI can actually access and understand your content. And while access and understand might sound rudimentary for some of you, the reality is a lot of sites are accidentally blocking AI without even knowing it. According to our data, around 5.9% of 140 million websites are blocking GBTO, which is OpenAI's crawler. That's millions of sites that are invisible to chat GBT.
So, in this lesson, I've got six technical checks and tips for you to make sure AI can find you so that it can promote you. Let's get started. So, the first thing you need to check is your robots.txt file. Robots.txt txt is a file on your site that tells crawlers what they can and can't access. And the thing is, it's not just Google's crawler you need to think about anymore. There are now dozens of AI specific bots that crawl the web. The main ones you should know about are GPTO and OAI searchbot from OpenAI, Cloudbot from Anthropic, and Google extended from Google.
If any of these are blocked in your robots.txt, txt, you're asking those AI platforms not to crawl your content. And assuming they obey your rules, they sure won't be recommending your pages then if they don't know what's on them. Now, you might not have blocked these bots intentionally, but a lot of sites inherit robots.txt rules from templates or old configurations, and some platforms add blocks by default. For example, Cloudflare has a feature called instruct AI bot traffic with robots.txt. That's now enabled by default. When this is on, Cloudflare automatically updates your robots.txt to signal that your content shouldn't be used for AI training.
So, if your site is on Cloudflare, you could be blocking AI crawlers without even realizing it. So, the first step is simple. Go to yourdomain.com/roots.txt and look for any lines that mention GPTO, Cloudbot, Google Extended, or OI Searchbot. If you see a disallow rule next to any of those, you're blocking that AI crawler. You can also use HF site audit to check this. Run a crawl on your site and it'll flag any robots.txt rules that might be blocking AI crawlers. Now, while we're on the topic of files AI reads, I want to make a quick note on something you might have heard of called LLM.txt.
This is a proposed standard kind of like robots.txt, but specifically designed to tell AI systems about your site. The idea is that you create a file atyoudommain.com/lms.txt that gives AI a summary of who you are, what your site covers, and where to find your most important content. It's useful in theory, but as of right now, no major LLM provider officially supports it. OpenAI doesn't use it. Anthropic publishes one on their own site, but hasn't confirmed their crawlers actually read it, and Google hasn't adopted it either. So, should you create one? Well, I don't think it'll hurt you, but I wouldn't prioritize it over the other things we've talked about in this lesson.
Robots.txt is still the file that actually matters most right now. All right, the second thing to check is how your site handles JavaScript. Some AI platforms can render JavaScript and some can't. Without getting too technical, Gemini and Copilot can render JS while ChatGpt's crawler does not. So if your content relies on JavaScript to load, which is common with single page apps and some React or Angular frameworks, ChatGBT literally can't see your content. It visits the page and gets an empty shell. The fix here is serverside rendering, which means your server sends the fully rendered HTML to the crawler instead of relying on JavaScript to build the page in the browser.
If you're already doing this for SEO, you're covered. If not, it's worth looking into, especially if AI visibility matters to you. A quick way to test this is to disable JavaScript in your browser and visit your own site. If the content disappears, you have a JavaScript rendering issue that's affecting AI crawlers, too. The third thing to consider is page speed. Now, you might be thinking, page speed is an SEO thing, not an AEO thing, but it can actually matter more for AI retrieval than for traditional search. When AI systems retrieve information in real time, they're fetching, parsing, and chunking your pages on the fly.
And if your page takes too long to load, it can get dropped before it's even scored. So, it won't be making it into in AI response, even if the content is great. The good news is that if you've already optimized your core web vitals for SEO, you're most of the way there. Fast loading pages with clean HTML benefit both Google and AI systems. And that brings us to the fourth tip. Create clean HTML structure. This one's straightforward. AI systems parse your content by following your HTML structure. So if your headings are logical, your sections are well organized, and your paragraphs are focused on one idea each.
AI has an easier time extracting the right information. This ties directly back to the content principles we covered in lesson 3.1. Bluff, atomic content, and entity rich writing. Those principles aren't just about writing style. They're about making your content technically parsible for AI. So, when you're structuring your pages, use proper heading hierarchy. H1 for the title, H2s for the main sections, and H3s for subsections, and make sure each section can stand on its own because AI might chunk your content at any heading boundary. The fifth tip is about schema markup. Schema markup, which is also called structured data, is code you add to your pages to help search engines understand your content.
Things like article schema, FAQ page, how-to, and local business. Now, does it help with AEO? Honestly, the evidence is mixed. There's no confirmed data that adding schema directly improves your chances of being cited by AI, but it doesn't hurt. And if you're already using it for SEO, there's no reason to remove it. I wouldn't spend a ton of time on schema specifically for AEO, but if you're setting up a new page, adding the right schema types is a good habit that makes your content easier for any system to understand. All right, the sixth tip that I have for you is to optimize for AI hallucinated URLs.
AI assistants sometimes make up URLs that don't exist on your site. They'll recommend a page to a user, the user clicks it, and they hit a 404 error. And this happens a lot more often than you'd expect. According to our data, AI assistants send visitors to 404 pages 2.87 times more often than Google search does. And Chat GPT is the biggest offender with about 1% of its clicked URLs leading to 404 pages. Now, rather than letting that 404 be the end of a visitor's browsing journey, you should either fix or optimize those pages to get more out of them.
You can do that by checking your analytics for pages that are getting traffic from AI referers but returning a 404 status. If you spot a hallucinated URL that's getting consistent traffic, set up a redirect to the most relevant real page on your site. That way, you're capturing traffic that would otherwise be lost. Now, while creating content and getting cited is a big part of AEO, it's only part of the picture. You also need to know if it's actually working. And that's exactly what we'll be covering in module four, which is all about measuring and tracking your AI visibility.
I'll see you there.
More from Ahrefs
Get daily recaps from
Ahrefs
AI-powered summaries delivered to your inbox. Save hours every week while staying fully informed.









