How to Get Cited by ChatGPT (1.4 Million Prompt Study Reveals the Truth)

Edward Sturm| 00:15:40|May 3, 2026

Chapters11

This chapter examines how ChatGPT cites sources after retrieving many URLs for a single query, revealing that only about half of the retrieved sources end up being cited and exploring why some pages are credited while others are not.

If you want AI to cite your pages, align your titles, content, and URLs with the AI’s fanout queries and target the right retrieval channels the model favors.

Summary

Edward Sturm dives into a provocative 1.4 million-prompt study from Hrefs to reveal how ChatGPT selects and cites sources. He explains that ChatGPT relies on a gatekeeping layer where the page title, snippet, URL, and an internal ID determine whether a source is opened and cited. Sturm highlights a striking mismatch: Reddit content is heavily used for context, yet cited only about 1.93% of the time, while YouTube and academic sources appear widely but are less likely to be cited. The analysis emphasizes semantic similarity between a page’s retrieval data and the user’s query as a major driver of citations, with titles and URLs that semantically align with the AI’s internal fanout queries performing best. He also covers the impact of ref_type categories (search, news, Reddit, YouTube, academia) and notes that fresh content helps in news, but relevance still trump freshness in most contexts. Sturm then translates the findings into practical SEO tactics, such as optimizing page titles, H1s, and natural-language URLs to match fanout queries, and using tools like hrefs Brand Radar and AI responses reports to diagnose gaps. He concludes that “SEO product pages and SEO landing pages” are particularly citied in AI, and plugs his compact keywords course for creators seeking to craft pages that attract AI-driven traffic. The episode closes with social proof from a testimonial and an invitation to learn more about Edward Sturm’s SEO approach.

Key Takeaways

ChatGPT’s citation decisions hinge on a gatekeeping step where the page title, snippet, URL, and an internal ID influence whether a source is opened and cited.
Only about 50% of retrieved URLs are actually cited, despite ChatGPT pulling dozens of URLs per query.
Cited pages show higher semantic similarity between their title and the user’s prompt, especially when considering fanout queries generated by the AI.
Natural-language URL slugs correlate with an 89.78% citation rate versus 81.11% for non-natural language slugs, indicating title/URL relevance matters.
Reddit is widely used as a retrieval source for context but is cited far less often (1.93%), suggesting the AI uses Reddit for understanding while not giving it credit.
Freshness matters more for news, where newer pages have a citation edge, but for most content, relevance to fanout queries outweighs age.
SEO-friendly product and landing pages designed around high-intent keywords and aligned with fanout queries are among the most citable content in AI.

Who Is This For?

Essential viewing for SEO professionals and content engineers who want to understand how AI cites sources and how to structure pages to improve citation odds. It’s also valuable for marketers looking to optimize content for AI-assisted search and citation behavior.

Notable Quotes

"“We discovered five categories, search, news, Reddit, YouTube, and academia. The citation rates between them are wildly uneven.”"

—Introduces the ref_type categories and sets up the uneven citation landscape.

"“Reddit has its own dedicated ref type in chat GPT's retrieval system with over 16 million data points... yet it is cited at a rate of just 1.93%.”"

—Highlights the paradox of heavy usage vs. low citation rate for Reddit.

"“Ultimately, the pages that get cited are the ones whose titles and content match the questions ChatGPT is asking behind the scenes.”"

—Summarizes the practical takeaway on alignment with fanout queries.

"“If you want to get cited in AI, you can find the searches that they are using.”"

—Direct call to action for tailoring content to AI search behavior.

Questions This Video Answers

How does ChatGPT decide which sources to cite from a search result set?
What makes a page more likely to be cited by AI—title relevance or freshness?
Why does Reddit get used for context but rarely cited in ChatGPT responses?
How can I optimize my SEO pages to improve their likelihood of AI citation?
What are fanout queries and how do they influence AI source selection?

ChatGPT citationsHrefs studyfanout queriessemantic similarityref_type categoriesSEO for AInatural-language URLsReddit vs YouTube vs News citationsAI-powered content strategyCompact Keywords course

Full Transcript

Why does CHACPT site one page over another? This is a study of 1.4 million prompts. This study is from Hrefs. We've all got used to the little numbered blue links in ChachiPT's responses. They're the citations that back up ChatP's responses with external information. But although Chacht retrieves dozens of URLs to answer a single query, according to our research, it only ends up citing around 50% of them. Why does one page get credit while another which the AI clearly retrieved gets nothing? According to studies by AI expert Dan Petrovic, when Chhat GPT retrieves results, each one comes back with a page title, a brief snippet or summary, the URL, and an ID number. Chat GPT uses this data to decide which pages are worth opening and eventually citing in its response. In other words, there's a gatekeeping layer before Chat GPT opens and reads any of your actual page content. The title, snippet, and URL are doing the heavy lifting in that initial decision. Before I go on, I'm picking out my favorite parts from this article. This whole article is fire, and I'm going to share the parts that I thought to be most impactful. Hrefs is cooking with the SEO content. They are always cooking with the SEO content. I love sharing their stuff. So, and also shout out to Patrick Sts from Hrefs who was on the podcast last week. That was episode 1,022 of the show Patrick stocks exposes the biggest SEO myths in 2026. That was a great episode, too. And this is going to be another great one. So, the article continues, "We wanted to know what actually influences citations. Does higher semantic similarity between a page's retrieval data and the user query increase citation likelihood? Which fields matter most? Do human readable URLs outperform opaque ones? To find out, we analyzed 1.4 million chatpt 5.2 prompts from February 2025 on desktop with the help of Href's data scientist Shibha Guan. But before we get into the findings, you need to understand how chat GPT actually gathers its sources because not all URLs enter the system the same way. When chat GPT retrieves results, it categorizes sources using internal fields called ref_type. Essentially a label for the retrieval channel the URL came through. And this next part is really fascinating. We discovered five categories, search, news, Reddit, YouTube, and academia. The citation rates between them are wildly uneven. The general search index dominates both in volume and citation rate. And 88% of the URLs that end up being cited by chatbt are taken directly from search. If you want to be cited by chatbt, you need to be in that search selection pool, which means your content needs to rank. There we go. This isn't new information. By now, most people are already aware that ranking plays a part, but it's nice to have some more data to back it up. Specialized verticals like YouTube and academia on the other hand are pulled in at scale but barely ever get surfaced as actual citations which actually from my research when I'm doing this I've seen the opposite. I see YouTube getting cited all the time and AD week put up an article saying that YouTube overtook Reddit as the most cited. This research from hrefs says different. However, you know, let's accept that we don't know what is more cited, Reddit or YouTube, but we do know that using both for SEO is very valuable and works really well. That's what we that's what we do know. And with this HERS research, Reddit gets 1.93% of the citations and YouTube gets 51% of the citations. Academia gets 4% of the citations. News gets 12.01% of the citations. Article continues, "The Reddit and YouTube ref types likely represent additional results. Those pulled in via dedicated API integrations on top of whatever the web search already returned. That's why the volume on those channels is so high. Chat GPT is supplementing its search results with a separate feed of Reddit and YouTube content. This matters a lot for interpreting the rest of the analysis. On average, ChachiPT pulls around 16.57 cited URLs and around 16.58 non-sighted URLs per prompt. Wow. But because Reddit makes up 67.8% of the non-sighted pool, any aggregate comparison of cited versus non-sighted is really comparing search results to Reddit API output, not apples to apples. So throughout this research, we've isolated the analysis by ref type wherever possible to avoid that distortion. And it's going back to this 67.8% of non-sighted URLs are from Reddit figure. And the article says this is probably the most striking finding in the data set. Reddit has its own dedicated ref type in chat GPT's retrieval system with over 16 million data points in our data set. Yet it is cited at a rate of just 1.93%. Meanwhile, 67.8% of all non-sighted URLs come from Reddit. In other words, Chat GPT is using Reddit extensively to understand topics, gauge consensus, and build context, but it almost never gives Reddit the credit. Ah, that's so interesting. It learns from the crowd, then cites another institution. All right, here we go. Titles need to be semantically relevant to fan out queries. to figure out what citable chatpt estimates relevance in a process sometimes described as semantic scoring to judge whether an article and a query are related. Since chatpt is a closed source model, we don't have visibility into exactly how it determines relevance internally. So in this study, we use cosign similarity computed from embeddings generated by open- source models to quantify and approximate how chatpt may work. Chat GPT matches URLs against its own fanout queries. the sub questions it generates internally from a user seed prompt to hunt for specific facts. So when you ask chatpt a question a lot of the times it goes off and does a web search with different queries. These are called fanout queries. The data confirms that title relevance to fanout queries is an important factor in citation. And this is no different from SEO. You want your page title, your H1, your slug to target the keyword. very very similar across all ref types. Cited URLs have consistently higher similarity between their title and the original prompt. The gap widens further when we compare against fanout queries instead of the original prompt, reinforcing that creating content relevant to Chachi BT's internal sub questions are what really drives selection. When we isolate and search ref type specifically, the pattern gets even sharper. cited pages are clearly more relevant and the non-sighted distribution drops significantly. We also found that search results with natural language URL slugs had an 89.78% citation rate compared to 81.11% for those without. Ultimately, here we go. Ultimately, if your URL and title don't semantically align with the AI's internal fanout queries, you're less likely to get cited. If you've been listening to this podcast for a long time, you have already heard this from me before and from my guests, from a lot of my guests before. If you want to get cited in AI, you can find the searches that they are using. And if you want to know how to see chatbt searches, here's how you do it. Brands get shown more on chatbt. Uh, yeah. Go to chatbt. Put in your prompt. Fashion trends for this winter. Rightclick inspect. Go to network. In the URL, copy what comes after forward/ C. Paste it in the filter. Now refresh the page. Click the orange brackets with the code you just put in. Go to response. Search this code for the word queries. Now you can see what chatpt searched the web with. Use the language that chatpt is searching with. put it in your content as H2 sections or create entirely new pieces of content for exactly what chat GPT is searching. That is how you get shown more in chat GPT. And to learn my exact method of doing SEO that gets paying customers, go to compact keywords.com. And shout out to Hrefs too. Hrefs has a section optimized for fanout queries using brand radar. You can study fanout queries directly inside href's brand radar. Head to the AI responses report. Pick any prompt and you'll see the fan out queries chatpt generated alongside the cited URLs. From there, use the AI content helper to check how well your page covers the topics those fanout queries address. It measures the cosign similarity between your content and the topics the SER or AI response is trying to cover and gives you a colored highlight as you write, showing which gaps remain. If a competitor's page is getting cited for a query where yours isn't, this is one of the fastest ways to diagnose why. Oh, this next part of the article had a lot of SEOs kind of going crazy. The average cited page is 500 days old and still getting picked. It's common knowledge that fresher content gets cited more by AI. And in fact, our own study of 17 million citations supports that. We found that ChachiPT cited URLs that were 458 days newer than Google's organic results, the strongest freshness preference of any platform we tested. This study doesn't contradict that narrative, but it does add an extra layer of nuance. For instance, when we looked at the search index, cited pages span a wide range of ages. The median is around 500 days, 1.3 years old, with some cited pages over 2,700 days old, around 7.4 years old. The median age is actually far lower than our initial freshness study linked above, suggesting that chatpt is skewing even younger in its citation preferences. That said, we also found that non-sighted pages are overwhelmingly very young. So within a single prompts retrieval set, it's the older, more established pages that tend to get cited and the freshest content that tends to get discarded. Wow. In other words, chat GPT prefers fresh content, but tends to site comparatively older content more often. That sounds counterintuitive, but both things can be true at the same time. Across the broader population of AI citations, Chat TPT does skew fresher when compared against Google results and even against its own citation preferences from only last year. But within a given retrieval set, freshness alone isn't enough. Relevance still does the heavy lifting. A new page that matches fanout queries well will get cited. A new page that doesn't will be retrieved yet ignored. Where freshness matters most is in news. In this category, title relevance scores for cited and non-sighted pages are nearly identical. The AI can't decide based on relevance alone, so it defaults to a temporal tiebreaker. Page age. Cited news pages skew younger. That makes sense. For news queries, younger pages have a clear advantage, even when relevance scores between cited and non-sighted pages are similar. And the conclusion to this article, this is what it all means for being quote unquote citable. The 1.4 million prompts paints a pretty clear picture. Chat GPT is an aggressive editor. It favors its general search index, uses semantic similarity to select and site sources, and treats Reddit as a textbook it's embarrassed to admit it read. Ultimately, the pages that get cited are the ones whose titles and content match the questions ChatBT is asking behind the scenes and that surface through the right retrieval channel. Thank you again to Hrefs for this article. You know what's cited really well in AI? SEO product pages and SEO landing pages. According to research from PromptWatch, these are two of the most cited types of content in AI. And these are ones that appear from hrefs. These are ones that would appear in the search ref type which is cited 88% of the time. SEO product pages and SEO landing pages. And when you do these properly, you are targeting keywords, common keywords, high intent keywords. And so your relevance is very high just like HF says you want to have. So literally your pages selling your product, selling your your product, your services, whatever it is that you offer are getting cited so well in AI according to research, this research from hrefs and the research that I mentioned from PromptWatch. And so if you want to learn how to make these SEO landing pages and SEO product pages, I have an entire course at compactkeywords.com about doing this. Compact Keywords shows you how to find very high intent searches that people are doing, how to target them, how to make these pages, what they should look like with very specific templates, how to build links, how to think about press, and so much more. I'm constantly getting the most incredible video testimonials from people. And I got this one last week from William Moon. I really love this one. I just finished the compact keyword training program and it absolutely blew my brain how good it was. Uh, my name is William Moon and I run a financial agency and I also have a side marketing agency that I run and my wife and I, we have six kiddos and I've been an entrepreneur providing for my family for well over a decade. And so we have depended upon good marketing strategies for the livelihood of my family and our six kids. And you know those kids, they ain't cheap, you know, they cost some money. And so when I came in, I did come in with the SEO background because we do our own marketing and stuff like that and I have a fascination for SEO. But what really struck me the most about the compact keyword training program was how small I was looking at SEO and how much bigger it could be. Even though, you know, we've depended upon it for over the last decade. I'm like, man, I could have used this a long uh time ago had I had I seen or known about it. And I'm just grateful now to have seen and to know about the compact keyword course by Edward Sturm. I highly recommend it. Even if you have a SEO background, there's always another level. And so I really encourage you whether you have an SEO background or if you don't. You don't even need SEO really a background to do it. And that's what I was kind of surprised too. Uh we do use Edward's uh exact posting strategy for social media content as well. My uh my oldest uh child, her and I have a jewelry brand that we're building together and she's on top of it every single day using his stuff and just adding in compact keywords is going to really uh pour fuel on the fire with that. We already use SEO and stuff, but it's just going to do a lot more and it's a lot bigger than what I thought it was. And so, thank you so much, Edward, and the team. So, if you want to save years learning SEO that brings customers, that brings users, that brings warm leads calling you up, that's at compactkeywords.com. And that is everything that I've got for you on this episode of the show. This is episode 1,032 of the Edward Show. 1,032 days in a row doing this podcast. If you watch us on YouTube, thank you so much for watching. If you listened on Spotify or Apple Podcasts, thank you so much for listening. And I will talk to you again tomorrow.