Harmonic centrality is a web graph metric that measures how well-connected a domain is based on how many other sites can reach it through short link paths. Common Crawl uses harmonic centrality to prioritize which domains get crawled, how frequently, and how many pages get fetched. Since 64% of major LLMs train on filtered Common Crawl data, a domain’s HC score influences its representation in AI training datasets.
Domains with high HC scores appear in more monthly crawl snapshots. Domains with low scores appear inconsistently or not at all. For brands trying to understand AI visibility, HC explains part of why some companies have structural advantages in getting mentioned by language models.
Key Takeaways:
- Harmonic centrality measures structural connectivity across the web graph
- Common Crawl has used HC for crawl prioritization since 2017
- Higher HC means more frequent crawling and greater training data representation
- HC is a diagnostic signal for understanding AI visibility, not an optimization lever
How Does Harmonic Centrality Affect AI Training Data?
The chain from HC to AI visibility is documented, not speculative:
1. HC determines crawl priority: Common Crawl’s internal database contains 25 billion URLs. Monthly crawls target approximately 3 billion. HC scores determine which domains make the cut and how many pages get fetched. According to the 2024 FAccT paper analyzing Common Crawl: “The higher the overall harmonic centrality score of a domain, the more likely it is to be included in a crawl, and the greater number of its pages are fetched.”
2. Crawl priority determines archive presence: Around 50% of URLs crawled monthly have been included in previous crawls. High-HC domains like Wikipedia appear in nearly every snapshot. Low-HC domains appear inconsistently or not at all. Well-connected domains accumulate representation across archives through repeated inclusion.
3. Archives feed training corpora:
- GPT-3’s training used 41 shards of Common Crawl data (2016-2019), representing over 80% of pre-training tokens.
- RefinedWeb extracted 5 trillion tokens from Common Crawl for Falcon.
- A Mozilla Foundation report found 64% of 47 LLMs analyzed used filtered Common Crawl data.
When model builders combine multiple snapshots, domains with high HC have proportionally higher representation.
4. Training representation affects model familiarity: More exposure during training correlates with better recall. Greater representation establishes baseline familiarity without guaranteeing citation in any specific query. The model has encountered your content, your terminology, and your way of explaining things.
What Makes Harmonic Centrality Different From PageRank?
Most marketers know PageRank. Harmonic centrality is different in ways that matter for understanding web graph signals.
| Aspect | PageRank | Harmonic Centrality |
| What it measures | Importance inherited from linking pages | Structural reachability across the graph |
| How it works | Authority flows from authoritative linkers | All paths count equally based on distance |
| Spam resistance | Lower (link farms can inflate scores) | Higher (requires genuine structural connectivity) |
| Common Crawl use | Secondary ranking | Primary crawl prioritization signal |
Common Crawl explicitly chose HC because it’s “more robust against link spam than PageRank.” This matters for AISO because it means the metric rewards genuine web presence over manufactured link signals.
In Common Crawl’s 2025 data, Wikipedia ranks #14 in HC but #37 in PageRank. The metrics capture different things, and Common Crawl uses HC as the primary driver for crawl decisions.
What Does This Mean for Your Brand’s AI Visibility?
HC is a diagnostic signal, not an optimization target
You can check your domain’s HC rank through Common Crawl’s web graph data. A low rank suggests your domain occupies a peripheral position in the web graph, which means less frequent crawling and less representation in training data.
HC reflects your domain’s actual structural position across the entire web. CDN domains (Cloudflare, gstatic) rank extremely high because they’re embedded across millions of sites. News organizations and major platforms rank high because they’re referenced constantly. These positions reflect genuine web infrastructure accumulated over years.
The brands with “built-in” AI visibility earned it structurally
When marketers notice that certain brands get mentioned by AI systems more readily, part of the explanation is structural. Those domains have high HC scores, which means more Common Crawl coverage, which means more training data representation, which means more model familiarity.
Real-time retrieval, freshness signals, and content quality also matter. Baseline familiarity from training data is one factor among several, and HC influences that baseline.
Other pathways to AI visibility exist
AI systems increasingly use retrieval-augmented generation (RAG), pulling fresh content at query time rather than relying solely on training data. Perplexity, ChatGPT with browsing, and Google’s AI Overviews all incorporate live retrieval. These systems can surface content regardless of training data.
The correct framing: low HC reduces the probability of appearing in training data, which may reduce baseline familiarity, without precluding visibility through retrieval-based pathways.
What Can You Actually Do About This?
What works (slowly)
Genuine link acquisition from well-connected domains improves your structural position over time. Not link building in the traditional SEO sense, but earning references from sites that are themselves well-connected. Industry publications, major platforms, and widely-used tools.
This is a long game measured in years, not months. The web graph reflects accumulated structural relationships, not recent activity.
What doesn’t work
- Buying links doesn’t improve HC meaningfully (spam resistance is a design goal)
- Creating microsites or link networks adds noise, not structural connectivity
- Rapid-fire content production doesn’t affect graph position
What to focus on instead
For most brands, AISO efforts are better spent on:
- Content that answers questions clearly (improves retrieval performance)
- Consistent entity references (helps AI systems recognize your brand)
- Presence on platforms AI systems actively retrieve from
- Structured data that helps AI understand your content
HC is worth understanding because it explains part of the AI visibility landscape. For most organizations, it’s a diagnostic insight rather than an action item.
Harmonic centrality is an infrastructure play rather than a strategic one. The metric explains part of why some brands have structural advantages in AI visibility. Understanding HC helps set realistic expectations about AI search optimization: some visibility advantages are baked into web structure, accumulated over years. For most brands, the actionable work lies in content quality, entity consistency, and presence on platforms where AI systems retrieve information.
Harmonic centrality explains the structural advantage some domains carry into AI systems. However, structural position is only part of the picture. What ultimately matters is how often your brand is retrieved, cited, and reused across real prompts.
Book a call with ReSO to see your AI visibility in action, where your brand shows up across ChatGPT, Perplexity, and AI Overviews, which prompts trigger you, and where citation gaps are limiting your presence.
Frequently Asked Questions
1. How can I check my domain’s harmonic centrality score?
Common Crawl publishes web graph data, including HC rankings. Third-party tools like the CC Rank Checker provide an easier lookup. Scores are normalized to a 0-10 scale, with higher numbers indicating better structural connectivity.
2. Does Common Crawl cover the entire web?
Common Crawl explicitly rejects this framing. Their crawl engineer stated: “Often it is claimed that Common Crawl contains the entire web, but that’s absolutely not true.” Monthly crawls target 3 billion URLs from a database of 25 billion. Major platforms like Facebook block Common Crawl’s bot entirely. Coverage is substantial but partial.
3. If my HC is low, will AI never mention my brand?
Low HC reduces training data representation but doesn’t prevent AI visibility. Real-time retrieval systems (Perplexity, ChatGPT with browsing) can surface content regardless of training data presence. Brand mentions across the web, structured data, and content quality all contribute independently of HC. The metric is one factor among many.
4. Is harmonic centrality the same thing as domain authority?
No, domain authority (as defined by Moz and similar tools) is a proprietary score predicting search ranking ability. Harmonic centrality is a graph theory metric measuring structural connectivity. They may correlate loosely since well-connected domains tend to rank well, but they measure different things and come from different sources.



