The Indexably Method — Complete Findings With External Research Context
By Andrew Coffey · 2026-03-24
Full research documentation and methodology reference for the Indexably calibration system.
Study Design
500 prompts across 12 query categories sent to 5 AI surfaces: OpenAI (ChatGPT with web search), Anthropic (Claude with web search), Gemini, Perplexity, and Google AI Overviews. 25,115 total citations collected, 18,129 unique cited URLs across 8,352 unique domains. 4,622 control URLs drawn from Brave Search top-20 results for the same queries (pages that appeared in search but were never cited by any model). 22,787 pages fetched and analyzed for 213 on-page structural signals. 1,000 cited domains + 1,000 control domains analyzed via DataForSEO Backlinks and Domain Analytics APIs for domain-level authority signals. Combined page + domain analysis via logistic regression and quartile stratification. All comparisons use Cohen's d (standardized mean difference) between cited and control groups. No causal claims are made. All findings describe measured differences between cited and non-cited pages/domains, not proven causes of citation.
How This Compares to Other Published Research
We evaluated this study against 13 methodological criteria organized into four groups: core methodology (criteria 1-7: multi-model citation collection, large-scale HTML extraction of cited pages, matched control set, 100+ on-page signals, statistical effect sizes, published numerical weights, cross-category query coverage), domain-level analysis (criteria 8-9: domain authority data with effect sizes, combined page+domain multivariate analysis), interaction analysis (criteria 10-11: page × domain authority interaction, within-domain comparison), and rendering/cross-model analysis (criteria 12-13: dual JS/non-JS rendering, per-model weight derivation). We searched extensively across academic databases (arXiv, Google Scholar, ACM Digital Library, Semantic Scholar), industry publications, and research output from 19+ named entities. We did not find a published study that meets all 13 criteria simultaneously. Four criteria appear to be completely unaddressed across published research we reviewed: page × domain authority interaction analysis, within-domain comparison, dual JS/non-JS rendering analysis, and per-model weight derivation from a unified dataset. Closest studies: SE Ranking XGBoost + SHAP study (216,524 pages, 20 niches, ~4-10 of 13 criteria). Zhang et al. (arXiv, Dec 2025, 55,936 queries, 1,418,733 citation hyperlinks, 4 full + 5 partial of 13). Ziptie / Mike King (2026, 1M+ AI responses, ~8 of 13). Sellm (400K+ URLs, 70+ features, ~4 of 13). AirOps (12,000-548,534 pages, 15 industries, ~3-4 of 13). GEO-16 Framework (arXiv, Sep 2025, 1,702 citations, ~3 of 13). Princeton GEO (Aggarwal et al., KDD 2024, intervention study, different methodology entirely). Ahrefs (75K brands, off-page focus).
Research Methods
The calibration system runs an 8-phase pipeline.
Data Collection Pipeline
Phase 1 — Prompt generation: 500 prompts distributed across 12 query categories (Academic/Research, Current Events, E-commerce/Product, Entity/Brand Recognition, Factual/Statistics, Financial/Legal, How-To/Procedural, Local Discovery, Local Services, Medical/Health, Product Comparison, Technical/Developer). Every prompt includes at least one forcing element — a temporal anchor ("in 2026"), a price constraint ("under $500"), or a comparison structure ("X vs Y") — to ensure AI models trigger web search rather than answering from parametric memory. Phase 2 — Citation collection: Each prompt is sent to all 5 AI surfaces with web search enabled. The system extracts every URL cited in each response. For Google AI Overviews, citations are collected via DataForSEO's SERP API. Each citation is tagged with the model that produced it, the prompt that triggered it, and the position within the response. Phase 3 — Control set construction: For each prompt, Brave Search API returns the top 20 organic results for the same query. Results are deduplicated against already-collected control URLs and the full cited URL set. The system continues collecting until the control set matches the number of cited URLs, with a query cap of whichever is smaller: 1.5x the target control size or 750 queries. The remaining URLs form the control group — pages that rank well in traditional search for the same queries but were not selected by any AI surface. This matched-query design controls for topic relevance: cited and control pages are competing to answer the same questions. Phase 4 — Page collection: Every cited and control URL is fetched in the same way AI crawlers access public web pages — respecting robots.txt and standard crawling conventions. Pages are collected in two modes: non-JS (static HTML as seen by most AI crawlers) and JS (fully rendered after JavaScript execution, representing what Gemini's crawler sees). The HTML from both modes is stored for signal extraction. Phase 5 — Signal extraction: Each page's HTML is parsed with cheerio (a server-side HTML parser) to extract 213 distinct on-page signals across categories: content structure (word count, paragraph counts, heading hierarchy, list usage), metadata (JSON-LD, Open Graph, Twitter Cards, canonical, robots directives), content patterns (statistics detection, FAQ patterns, comparison tables, answer capsules), readability (Flesch-Kincaid grade, sentence length, lexical diversity), link analysis (internal/external counts, nofollow/sponsored/ugc breakdown), technical markup (schema types, script counts, semantic HTML usage), and freshness (publication date, modification date, copyright year). Phase 6 — Domain signal collection (separate from page pipeline): The top 1,000 cited domains (by citation frequency) and 1,000 randomly sampled control domains are analyzed via two DataForSEO APIs. The Backlinks API returns referring domain counts, referring subnet/IP counts, Domain Rank (0-1000 scale), link attribute breakdowns (nofollow, sponsored, ugc), link platform types (news, blogs, wikis, e-commerce), link semantic locations (article body, header, footer, nav), and institutional TLD counts (.edu, .gov). The Domain Analytics API returns organic keyword counts by position range (1, 2-3, 4-10, through 91-100), estimated organic traffic value, and keyword trend data (new, lost, up, down). A third API — DataForSEO's OnPage API — was also called during the domain calibration. It crawls up to 20 pages per domain and returns technical SEO metrics: duplicate content, broken links, non-indexable pages, canonicalization status, SEO-friendly URL checks, and an overall onpage score. We ran it on all 2,000 domains and stored the results. However, the OnPage signals produced almost no meaningful differentiation between cited and control domains — the strongest signal was canonicalization status code at d=0.334, and most others fell below d=0.12. More importantly, the signals that did show moderate effects (internal/external link counts per page) were already captured more reliably by the Backlinks API at the domain level. None of the OnPage signals are included in the final domain-level metric composites or the combined analysis. The raw data is preserved for potential future use. Phase 7 — Statistical analysis: Described in detail in the following sections. Phase 8 — Combined analysis: Pages are joined to their domain's signals by URL-to-domain extraction. Logistic regression and quartile stratification are run on the joined dataset.
How We Measure Effect Size: Cohen's d
Cohen's d is the primary statistical measure used throughout this study. It quantifies the standardized difference between two groups — in our case, cited pages/domains vs. control pages/domains. The formula: d = (mean_cited - mean_control) / pooled_standard_deviation. Where pooled standard deviation is: SD_pooled = sqrt(((n_cited - 1) * SD_cited² + (n_control - 1) * SD_control²) / (n_cited + n_control - 2)). Interpretation: A positive d means cited pages score higher on that signal. A negative d means control pages score higher. The magnitude indicates how large the difference is relative to natural variation in the data. Conventional thresholds: d=0.2 is "small," d=0.5 is "medium," d=0.8 is "large." Most of our page-level signals fall in the d=0.05-0.30 range (below or just at "small"). Our strongest domain-level signal (Rank, d=1.075) is "large." Why Cohen's d instead of other measures: Cohen's d is unitless and comparable across signals with different scales. You can directly compare hasDoctype (a 0/1 binary) against wordCount (ranging 0-50,000) because both are standardized by their own distributions. Spearman correlation (used by Ahrefs) measures rank-order association. Logistic regression coefficients (used in our combined analysis) measure predictive contribution. Each answers a slightly different question — Cohen's d answers "how different are these two groups?" Point-biserial correlation is also computed alongside Cohen's d. It measures the correlation between citation status (a binary: cited or not) and the signal value (continuous). The two metrics are mathematically related and tell the same story from different angles. We report both for completeness.
How Composite Scores Work
Raw signals are too numerous and individually noisy to use directly for scoring. They are grouped into 10 page-level composites (crawlIndexSignals, contentRelevance, engagementCues, ragRetrievalSuitability, multimodalReadiness, structuredMetadata, aiReadability, pageFreshness, domainExpertise, citationSuitability) based on what dimension of citation readiness they measure. Each composite has a scoring function that takes the raw signals assigned to it and outputs a 0-100 score. For example, crawlIndexSignals combines hasCanonical, hasLangAttribute, hasDoctype, hasViewportMeta, hasMetaDescription, and related signals. A page with all of them scores high; a page missing most scores low. Cohen's d is then computed at the composite level — the 0-100 score for cited pages vs. the 0-100 score for control pages. This composite-level d is typically larger than individual signal d values because aggregation cancels random noise while preserving the consistent directional tendency (cited pages scoring slightly higher on many signals within the composite).
How Weights Are Derived
The weight assigned to each composite is its share of total predictive signal: weight_i = |d_i| / sum(|d_all|) × 100%. Where |d_i| is the absolute Cohen's d for composite i. A composite with d=0.30 gets twice the weight of one with d=0.15. This ensures composites that more strongly differentiate cited from control pages have more influence on the overall score. Floor enforcement: Composites with negative d (control pages scored higher) or negligible d (|d| < 0.1) are floored at 3% weight. This prevents them from receiving zero weight (which would break the normalization) while limiting their scoring influence. The deficit from floor adjustments is subtracted proportionally from the higher-weighted composites. Confidence classification: Each composite receives a confidence label based on its Cohen's d magnitude and sample size — "high" (d ≥ 0.2, n ≥ 500), "medium" (d ≥ 0.15, n ≥ 100), "low" (d ≥ 0.1), or "table_stakes" (d < 0.1). These labels indicate how much trust to place in each weight.
Volume-Weighted Cross-Model Analysis
The pooled weight set treats all cited pages equally regardless of which model cited them. This means OpenAI (2,075 unique cited URLs) gets the same influence per page as Gemini (6,815 unique cited URLs), even though Gemini produces 3x more citation data. The volume-weighted method adjusts for this. Each model's contribution to the overall effect size is proportional to its cited page count. A signal that Gemini (6,815 pages) finds predictive carries more weight than one that only OpenAI (2,075 pages) finds predictive, because Gemini's larger sample gives it more statistical power. Both pooled and volume-weighted results are computed and stored. The volume-weighted set is used for production scoring because it produces more stable weights and avoids dilution from lower-volume models.
How the Control Set Works
The control set answers a specific question: among pages that are topically relevant to the same queries, what makes cited pages different from non-cited pages? For each of the 500 prompts, Brave Search returns the top 20 organic results. Results are deduplicated against already-collected control URLs and the cited URL set, and collection continues until the control set size matches the cited set (capped at 1.5x the target or 750 queries, whichever is smaller). This means control pages are not random web pages — they are pages that rank well in traditional search for the same queries. They are topically relevant, generally well-maintained, and often from reputable domains. The comparison is between "good pages that AI cited" and "good pages that AI didn't cite," not between cited pages and random internet garbage. Excluded domains: Wikipedia, YouTube, Reddit, and Google.com are excluded from the cited set before weight computation. These domains are structurally anomalous — Wikipedia lacks typical commercial page structures, YouTube pages are video wrappers, Reddit is user-generated threads, and Google.com results are navigation/SERP pages. Including them would contaminate the signal analysis. They remain in per-signal diagnostics and in the raw data for transparency. Sample sizes: 18,129 unique cited pages vs. 4,622 control pages for non-JS analysis. 6,814 cited vs. 4,682 control for JS analysis. The cited-to-control imbalance (roughly 4:1 for non-JS) exists because the 18,129 cited URLs exceeded the 750 query cap — the system couldn't collect enough unique control URLs from 750 Brave queries to match 18,129 cited URLs after deduplication. This is acceptable for Cohen's d computation — the standard error at these sample sizes is SE(d) ≈ 0.012, giving tight confidence intervals.
Domain-Level Composite Scoring
Domain signals have wildly different scales (Rank is 0-1000, Referring Domains can be 0-900,000, Spam Score is 0-100). To combine them into composites, each signal is percentile-normalized: for each domain, its raw value is converted to a percentile rank within the full distribution (cited + control combined). A domain at the 75th percentile for referring domains gets a score of 75, regardless of whether the raw number is 12,000 or 120,000. Signals where lower values indicate better quality (like spam scores) are inverted: score = 100 - percentile. The percentile-normalized scores within each composite are averaged to produce a 0-100 composite score per domain. Cohen's d is then computed on these composite scores between cited and control domains.
Combined Analysis: Logistic Regression
The logistic regression predicts citation status (1 = cited, 0 = control) from 14 features: 10 page-level composite scores + 4 domain-level composite scores. Aggregation to domain level: Because multiple pages share the same domain (and therefore identical domain features), running the regression at page level would duplicate domain features and inflate the sample without adding information. Instead, the regression operates at the domain level: for each domain, page composite scores are averaged across all its pages, producing one row per domain with 10 averaged page features + 4 domain features + a citation label. Mixed domain exclusion: 510 domains have both cited and control pages. These are excluded from the regression because their citation label is ambiguous. They are analyzed separately in the within-domain comparison. Standardization: All 14 features are standardized to mean=0, std=1 before training. This ensures features with larger numeric ranges don't dominate coefficients regardless of predictive power. Training: Gradient descent on a sigmoid function. Learning rate 0.001, maximum 5,000 iterations. No regularization (14 features with ~1,500 observations doesn't overfit). Feature importance: Because features are standardized, the absolute value of each coefficient directly indicates relative importance. Importance percentage = |coefficient_i| / sum(|all coefficients|) × 100. McFadden pseudo R²: Compares the fitted model's log-likelihood to a null model that predicts the base rate for every observation. R² = 1 - (LL_model / LL_null). McFadden R² values are structurally lower than OLS R² — a McFadden R² of 0.2-0.4 is generally considered excellent fit. Values should not be interpreted as "percentage of variance explained." Variance decomposition: Three regressions are run — domain features only, page features only, and all 14 combined. Comparing the R² values shows how much each group contributes independently: domain unique = combined R² - page-only R², page unique = combined R² - domain-only R², shared = domain R² + page R² - combined R².
Combined Analysis: Quartile Stratification
All joined pages are split into 4 equal groups by their domain's raw Rank value. Within each quartile, Cohen's d is computed on the 10 page-level composites between cited and control pages. This shows whether page optimization matters equally across all authority levels, or whether it only matters at certain levels.
Combined Analysis: Within-Domain Comparison
Domains that have both cited and control pages in the dataset are identified. For these domains, every domain-level factor is identical between cited and control pages — they're on the same domain. Any measured difference is purely attributable to page-level factors. Cohen's d is computed on the 10 page metrics for only these pages. This is the cleanest possible test of whether on-page signals have independent predictive value.
How AI Models Cite Differently
Each model produces citations at different rates and favors different types of sources. This finding aligns with the Search Atlas study (5.5M responses, Dec 2025) which found citation behavior is platform-specific, and with Ahrefs' finding that only 7 of the top 50 most-mentioned domains appeared across all three major AI surfaces (ChatGPT, Google AI Overviews, Perplexity) — a staggering 86% were unique to each assistant. An important note on our citation rates: Nectiv (October 2025) found that only about 31% of typical ChatGPT prompts trigger a web search at all. Our rates are much higher because we deliberately designed every prompt to force web search — each one includes a temporal anchor ("in 2026"), a price constraint, or a comparison structure. Writesonic's March 2026 study confirmed that prompts with these elements trigger search 100% of the time. This was intentional — we needed models to actually retrieve pages from the web to measure what gets cited. With that context: Anthropic 14.32 avg citations per prompt, 99% citation rate. Gemini 13.92 avg, 98.2% rate. Google AIO 8.76 avg, 80.8% rate (20% of prompts produce no AI Overview at all). Perplexity 7.62 avg, 100% rate (never fails to cite). OpenAI 5.62 avg, 96% rate. Profound (June 2025) found that users' opening questions trigger web searches but follow-ups rarely do — if you want to be cited, you need to win the first question. Each model has distinct source preferences: OpenAI favors news sources (apnews.com, axios.com) and reference sites (wikipedia, arxiv). Anthropic favors medical/academic sources (pmc.ncbi.nlm.nih.gov, pubmed) and government (.gov) sites. Gemini and Perplexity both heavily favor YouTube — Perplexity cited youtube.com 209 times, more than any other single source across any model. Google AIO favors academic sources (pmc.ncbi.nlm.nih.gov at 126 citations) and platforms (medium.com, reddit.com). This is consistent with Ahrefs' findings that Google AIO leans on its own ecosystem (YouTube) plus user-generated content (Reddit, Quora), while ChatGPT favors publishers and media partnerships. Profound (June 2025) found Wikipedia is the most cited source in ChatGPT (7.8%), followed by Reddit (1.8%), Forbes (1.1%), and G2 (1.1%). Models weight page-level factors differently: OpenAI uniquely prioritizes pageFreshness (21.6%) and structuredMetadata (20.1%) — no other model puts either above 10%. Anthropic spreads weight evenly — its top metric is ragRetrievalSuitability at 14.7%, and 7 of 10 metrics are between 11-15%. Gemini produces the strongest effect sizes overall — crawlIndexSignals at d=0.304, the highest individual composite effect in the entire study. Perplexity closely resembles the cross-model average. Google AIO is the most balanced model — its top metric is 12.8% and its bottom is 6.7%, the flattest distribution of any model. The Writesonic industry citation study (Nov 2025) found that core content types account for 45-50% of citations and show only 2-5x variation across industries — format matters more than industry. ChatGPT maintains the most consistent citation behavior across industries (44.6% CV), while Claude is the most variable (147.9% CV). Our per-model weight divergence confirms this: each model has distinct preferences, but the structural formats that attract citations are relatively consistent across them.
Page-Level Signals That Differentiate Cited Pages
All effect sizes below are "small" by conventional standards (d < 0.5). The strongest non-behavioral signal in the entire dataset is lexicalDiversity at d=0.268 (JS mode). Most individual signals are in the d=0.05-0.15 range. Strongest individual page signals (JS mode): lexicalDiversity d=0.268, shortParagraphPct d=0.250, shortestSentenceWordCount d=0.240, titlePresent/hasTitle d=0.239, hasDoctype d=0.234, hasOpenGraph d=0.219, hasLangAttribute d=0.217, hasAlternateLink d=0.215, hasViewportMeta d=0.208, hasCallToAction d=0.202. Strongest individual page signals (non-JS mode): lexicalDiversity d=0.201, sponsoredLinkCount d=-0.150 (inverse — cited pages have fewer sponsored links), shortestSentenceWordCount d=0.146, hasDoctype d=0.140, titlePresent/hasTitle d=0.137, hasLangAttribute d=0.132, hasAlternateLink d=0.130, hasNoarchive d=-0.129 (inverse — cited pages less likely to block archiving). External research alignment: The Princeton GEO paper found statistics addition improved visibility by 22-41%. Our statisticsDetected signal shows d=0.088 (JS) and d=0.069 (non-JS) — positive but modest. The magnitude difference is explained by methodology: Princeton measured the effect of adding statistics to pages that lacked them (intervention study), while we measured the observational correlation between having statistics and being cited. An intervention effect of +40% can manifest as a small correlational d when most pages already have statistics (76% prevalence in our cited pages). The GEO-16 framework (1,702 citations) found metadata/freshness, semantic HTML, and structured data most strongly predict citation. Our data confirms this pattern — crawlIndexSignals (which includes canonical, lang, viewport, meta description) is the #1 weighted composite across every model and weight set. Growth Memo / Kevin Indig (Feb 2026) found 44.2% of LLM citations come from the first 30% of text. Our hasAnswerCapsule signal (d=0.014 JS, d=0.016 non-JS) is positive but small, consistent with the finding that front-loaded content helps. Signals that are consistently negative (control pages score higher): sponsoredLinkCount d=-0.189 JS, d=-0.150 non-JS — cited pages have fewer sponsored links. hasNoarchive d=-0.198 JS, d=-0.129 non-JS — cited pages less likely to block archiving. hasFigcaptions d=-0.163 JS, d=-0.087 non-JS — cited pages less likely to have figure captions (likely driven by Wikipedia/academic pages in control set). externalLinkCount d=-0.149 JS, d=-0.088 non-JS — cited pages have fewer external links. maxSectionWordCount d=-0.134 JS, d=-0.106 non-JS — cited pages have shorter maximum sections. wordCount d=-0.091 JS, d=-0.042 non-JS — cited pages are not longer; if anything, slightly shorter. This contradicts some external findings. SE Ranking (November 2025) reported pages over 2,900 words average 5.1 citations vs 3.2 for under 800 words. Our data shows a slight inverse relationship. The difference may be because their finding measures citation frequency (how many times a page is cited across all prompts) while our measurement compares cited vs non-cited pages (whether a page gets cited at all). The overall pattern: The signals are not "more content = more citations." It's "well-structured, cleanly marked-up pages with proper HTML fundamentals and moderate content length." The strongest differentiators are structural hygiene (doctype, lang attribute, viewport meta, canonical) and content quality indicators (lexical diversity, short paragraphs) rather than content volume. JS vs. non-JS divergence — a finding unique to this study: No other published study distinguishes between what a page looks like to a JS-rendering model (Gemini) vs. a static-HTML model (everyone else). Table-related signals (hasTableHeaders, comparisonTableDetected) are strong in non-JS but flat/negative in JS. Schema and metadata signals are stronger in JS mode. This is relevant because an estimated 69% of AI crawlers cannot execute JavaScript (Vercel, 2025).
How Signals Aggregate Into Composites
Individual signals are noisy and small. Composites aggregate them and produce more reliable effect sizes. Page-level composite weights (volume-weighted non-JS — the production weight set): Rank 1: crawlIndexSignals 14.8% (d=0.158, "Can AI find it?"). Rank 2: ragRetrievalSuitability 13.0% (d=0.140, "Will AI cite it?"). Rank 3: contentRelevance 12.6% (d=0.136, "Can AI read it?"). Rank 4: engagementCues 12.5% (d=0.135, "Will AI cite it?"). Rank 5: multimodalReadiness 11.4% (d=0.123, "Can AI read it?"). Rank 6: aiReadability 8.8% (d=0.095, "Can AI read it?"). Rank 7: citationSuitability 7.2% (d=0.077, "Will AI cite it?"). Rank 8: structuredMetadata 6.7% (d=0.072, "Can AI find it?"). Rank 9: pageFreshness 6.7% (d=0.072, "Will AI cite it?"). Rank 10: domainExpertise 6.3% (d=0.068, "Can AI find it?"). Key observations: crawlIndexSignals is #1 across every weight set. Google's John Mueller (November 2025) said "you don't need to create bot-only Markdown or JSON clones — clean HTML works just fine." Our data confirms this — the fundamentals of being findable by AI (proper HTML structure, canonical tags, meta descriptions, lang attributes) consistently differentiate cited from non-cited pages. The weight distribution is relatively flat — the gap between #1 (14.8%) and #10 (6.3%) is much smaller than a typical best-practice checklist would suggest. No single dimension dominates. This is worth noting because many tools in this space emphasize one or two factors as primary drivers — the data says it's more distributed than that. JS mode produces stronger effects across the board (top composite d=0.304 vs d=0.158 for non-JS). Pages that rely on JavaScript rendering show larger differences between cited and control, likely because JS-dependent pages have more variance in render quality.
Domain-Level Signals
Domain-level effect sizes are dramatically larger than page-level effects. Core authority signals (Backlinks API): Rank (0-1000 authority score) d=1.075 (cited mean 431, control mean 277), Referring Subnets d=0.513 (cited 14,700, control 1,800), Referring IPs d=0.404 (cited 33,100, control 3,300), Crawled Pages d=0.334 (cited 345,500, control 53,300), Referring Main Domains d=0.242 (cited 83,100, control 6,600), Referring Domains d=0.235 (cited 96,500, control 7,400), Total Backlinks d=0.098 (cited 48.7M, control 467K). DataForSEO Rank at d=1.075 is the single strongest signal in the entire calibration system. This is consistent with the Ahrefs 75K brands study finding brand authority as the strongest predictor of AI visibility. However, while Ahrefs measured brand mentions (Spearman ρ=0.664 with AI Overview presence), we measured domain authority metrics against actual page-level citation. Different measurement, same conclusion: domain-level authority dominates. Referring subnets (d=0.513) is stronger than referring domains (d=0.235). This is a genuinely novel finding not present in other published research. SE Ranking (November 2025) found that sites with over 32,000 referring domains are 3.5x more likely to be cited by ChatGPT — but they didn't distinguish between raw domain count and network diversity. Our data suggests AI systems (or the search indexes they rely on) respond to link diversity (unique network blocks) more than link volume. A domain with backlinks from 14,700 different subnets represents genuine independent endorsement from across the internet. Total backlinks is the weakest authority signal at d=0.098. This challenges the traditional SEO emphasis on raw backlink count. Quality and diversity metrics (subnets, IPs, unique referring domains) matter far more than volume. Ahrefs' own research showed web mentions (ρ=0.664) correlate much more strongly than backlinks (ρ=0.218), and our data tells the same story from a different angle. Search visibility signals (Domain Analytics API): Organic Keywords pos 91-100 d=0.281 (cited 24,700, control 2,300), Organic Keywords pos 81-90 d=0.267 (cited 41,700, control 3,900), Estimated Paid Traffic Cost d=0.261 (cited $44.3M, control $1.7M), Organic Keyword Count d=0.190 (cited 1.0M, control 65K), Organic Traffic Value (ETV) d=0.168 (cited $19.4M, control $1.2M), Organic Keywords pos 1 d=0.139 (cited 28,000, control 1,200). Long-tail keyword rankings (positions 91-100, d=0.281) are more predictive than top rankings (position 1, d=0.139). This measures domain breadth. Ahrefs (August 2025) found that 80% of LLM citations don't even rank in Google's top 100 for the original query — and only 12% of URLs cited by ChatGPT, Perplexity, and Copilot rank in Google's top 10. Our finding adds to this picture: what matters isn't whether a domain ranks #1 for specific queries, but whether it has massive topical coverage across thousands of keywords. A domain ranking for 24,700 keywords even at positions 91-100 has demonstrated comprehensive authority that AI systems recognize.
Domain Authority vs Page Optimization — The Combined Analysis
1,488 domains matched between page and domain datasets (490 cited, 998 control, 510 mixed domains excluded from regression). Logistic regression feature importance (14 features, domain-level aggregation): Domain factors account for 77.2% of the model's predictive power. Page factors account for 22.9%. Top features by importance: linkQuality 24.3% (domain), searchVisibility 23.1% (domain), domainAuthority 19.4% (domain), domainScale 10.4% (domain), domainExpertise 5.4% (page — but negative coefficient), crawlIndexSignals 3.4% (page). This ratio is broadly consistent with what external research has suggested. The Ahrefs study found the top 3 correlations with AI visibility were all off-site factors: brand web mentions (ρ=0.664), branded anchors (ρ=0.527), and branded search volume (ρ=0.392). SE Ranking (November 2025) found domains with millions of brand mentions on Quora and Reddit have roughly 4x higher chances of being cited, and domains with profiles on platforms like Trustpilot, G2, Capterra, and Yelp have 3x higher chances. An influential March 2026 essay ("GEO is a Racket") argued that 89% of AI citations come from earned media. Our data gives a more precise quantification from a unified dataset: roughly 77/23 domain vs page. Variance decomposition (McFadden pseudo R²): Domain-only model R²=0.097. Page-only model R²=0.001. Combined model R²=0.100. Shared overlap between domain and page: ~0%. NOTE: McFadden R² cannot be interpreted as "percentage of variance explained" in the OLS sense. What the decomposition reliably tells us: domain factors contribute roughly 100x more to the model's predictive improvement than page factors, and the near-zero shared overlap means domain and page factors measure genuinely independent dimensions — backlink profiles have nothing to do with HTML structure. Quartile stratification — the most actionable finding: Pages split into 4 groups by Domain Rank score. Q1 (lowest authority, Rank 0-400): 1,749 cited, 1,085 control, 61.7% citation rate. Q2 (Rank 400-526): 2,485 cited, 596 control, 79.0%. Q3 (Rank 526-611): 1,976 cited, 696 control, 75.9%. Q4 (highest authority, Rank 611-964): 2,142 cited, 618 control, 77.6%. Page-level effects by quartile (averaged across 10 metrics): Q1 near zero (avg d ≈ 0.0). Q2 slightly negative (avg d ≈ -0.07). Q3 consistently negative (avg d ≈ -0.17). Q4 consistently positive (avg d ≈ +0.33). Page optimization only produces measurable positive effects among the highest-authority domains (Q4). In Q1-Q3, page-level differences between cited and control pages are flat or slightly inverse. Page optimization compounds with domain authority rather than substituting for it. No other published study has measured this interaction. Ahrefs and Princeton GEO and GEO-16 have separately shown that domain authority matters and that on-page factors matter — our quartile analysis is the first published measurement of how they interact — and the answer is that on-page optimization provides lift on top of authority, not independently of it. Within-domain comparison (510 mixed domains, 272 qualifying): When domain authority is held perfectly constant (same domain, different pages): crawlIndexSignals d=0.103 (low confidence) — the only metric above table_stakes. All other metrics d < 0.08 (table_stakes). This confirms that on-page structural differences are real but very small when domain authority is removed as a variable.
What External Research Says That We Can Now Validate Or Challenge
Statistics addition improves visibility by 22-41% (Princeton GEO, KDD 2024) — our data: statisticsDetected d=0.088 JS, d=0.069 non-JS — positive, consistent direction. Magnitude difference explained by intervention vs. observational methodology. Brand web mentions are the strongest predictor ρ=0.664 (Ahrefs 75K brands study) — our data: DataForSEO Rank d=1.075, confirms domain authority massively outweighs on-page signals. Our data measures authority metrics rather than mentions, but the conclusion converges. Pages with FCP under 0.4s get 3x more citations (SE Ranking, 2025) — our data: responseTimeMs d≈0.00, no measurable effect. However, we measured server response time, not browser FCP. Different measurement, can't confirm or deny. Citation patterns are platform-specific (Search Atlas, 5.5M responses, Dec 2025) — confirmed. Per-model weights diverge meaningfully. OpenAI favors freshness/metadata, Gemini favors crawlability, Google AIO is most balanced. 44.2% of citations come from first 30% of text (Growth Memo / Kevin Indig, Feb 2026) — hasAnswerCapsule d=0.014-0.016, positive but very small. Content positioning helps modestly. Sites with 32K+ referring domains are 3.5x more likely to be cited (SE Ranking, Nov 2025) — referring domains d=0.235, cited mean 96,500 vs control 7,400. Direction confirmed, though our effect size measure is different from their odds ratio. Pages >2,900 words average 60% more citations (SE Ranking, November 2025) — wordCount d=-0.042 to -0.091, slightly inverse. Cited pages are NOT longer. This may measure different things: their finding was about citation frequency per page, ours is about whether a page gets cited at all. Core content types account for 45-50% of citations (Writesonic, Nov 2025) — not directly measured at signal level, but our 12 query categories span all content types and the composite weights are stable across them. Only 11% of domains cited by both ChatGPT and Perplexity (Ahrefs, 2025) — not directly measured at domain level in this analysis, but the per-model weight divergence and the different top-cited domain lists confirm platform specificity. 50-90% of LLM citations aren't fully supported by cited sources (Wu et al., Nature Communications, 2025) — not measurable in our framework — we measure what gets cited, not whether the citation is accurate. Important caveat for honest messaging. Domains with G2/Trustpilot/Capterra profiles have 3x citation odds (SE Ranking, Nov 2025) — not directly measured (we don't check for review platform profiles). Consistent with our domain authority findings — having review profiles is a proxy for brand establishment. 40-60% of cited sources rotate monthly (Semrush, 2025) — not testable with a single-run dataset. Would require repeated calibration runs to measure temporal stability.
What We Don't Know
Query-topic matching. The combined model (domain + page factors) produces a McFadden R² of 0.10. The majority of what determines citation is not captured by either structural page signals or domain authority metrics. The most likely driver is whether the page's content matches the specific question being asked — a content relevance and topic coverage question, not a structural one. But "most likely" is our reasoning, not something the analysis measured. Temporal stability. Our latest calibration used different prompts, different methodology, and different signal wiring than earlier calibrations. We cannot isolate whether weight differences between calibration runs reflect genuine changes in AI behavior or methodology changes. Semrush (2025) found 40-60% of cited sources rotate monthly, suggesting rapid change. Stability testing requires running the same methodology twice. Causation. All findings are correlational. We measured what cited pages look like compared to non-cited pages. We did not test whether changing a specific signal causes more citations. The Princeton GEO paper is the closest thing to causal evidence (intervention + measurement), and their findings are consistent with ours in direction if not magnitude. Local and vertical-specific patterns. The calibration aggregates all 12 query categories. Citation patterns for local service queries may differ from academic or technical queries. The current data cannot answer whether LocalBusiness schema matters specifically for local queries. AI system internals. We observe outputs (which pages get cited) and measure correlates. We don't know how any of these AI systems actually select sources. ChatGPT is known to use Bing as a backend (matching Bing's top 10 results 87% of the time per external research), while Perplexity uses its own retrieval system. The mechanisms differ, but we only see the results. The "GEO is a Racket" challenge. An influential March 2026 essay argued that 89% of AI citations come from earned media, meaning the driver is essentially traditional PR, not page optimization. Our data partially supports this: domain authority (built through earned media, links, and brand mentions) accounts for 77% of the model's predictive power. But the 23% attributed to page factors — and especially the Q4 quartile finding showing d=0.33 page-level effects among high-authority domains — suggests page optimization is not zero-value. It's secondary to authority, but it's measurable and real for domains that have authority.
Statements We Can Make With Confidence
1. Across 5 major AI surfaces and 500 diverse prompts, pages that get cited have measurably different structural characteristics than pages that don't — but the effect sizes are small (d=0.05-0.30 for individual page signals). 2. Domain authority is a much stronger differentiator of AI-cited vs non-cited content than on-page structural signals. The logistic regression assigns 77% of predictive importance to domain factors and 23% to page factors. 3. Among high-authority domains, page optimization produces meaningful positive effects (d=0.24-0.37 per metric in Q4). Among lower-authority domains, page signals do not measurably differentiate cited from non-cited pages. No other published study has quantified this interaction. 4. The most consistent page-level differentiators are structural fundamentals: proper HTML doctype, language attributes, viewport meta, canonical tags, and meta descriptions. These make up crawlIndexSignals, the #1 weighted composite across every model and weight set. This is consistent with the GEO-16 framework's finding that metadata and structural HTML most strongly predict citation. 5. Content quality signals (lexical diversity, paragraph structure) are real but secondary to structural fundamentals. 6. Backlink diversity (unique subnets d=0.513, unique IPs d=0.404) is more predictive of citation than raw backlink volume (total backlinks d=0.098). This is a novel finding not present in other published research that focused on referring domain counts. 7. Each AI model has distinct citation preferences. This confirms the Search Atlas finding and adds specific per-model weight breakdowns that no other study has published. 8. Based on our extensive search of academic and industry sources, we did not find a published study that combines all 13 methodological elements present in this work: multi-model citation collection, large-scale HTML extraction, matched control set, 100+ on-page signals, statistical effect sizes, published numerical weights, cross-category queries, domain-level authority data, combined page+domain multivariate analysis, page × domain authority interaction, within-domain comparison, dual JS/non-JS rendering, and per-model weight derivation. Several studies cover subsets of these elements — SE Ranking, Zhang et al., Ziptie, and others each contribute valuable pieces. If we've overlooked a study that covers this ground, we'd welcome the correction.
What This Means For People Trying to Get Cited by AI
The single biggest factor is domain authority — built over years through publishing quality content, earning links and mentions, and establishing brand reputation. Ahrefs' data, SE Ranking's data, and our data all converge on this conclusion from different angles. There is no shortcut. For established domains (high authority), page-level optimization provides measurable additional lift. The most impactful actions are structural: ensure proper HTML fundamentals (doctype, lang attribute, canonical, viewport meta, meta description), use varied vocabulary, write in shorter paragraphs, and avoid blocking AI access (no noarchive, no restrictive robots directives). The Princeton GEO paper showed that adding statistics and quotations improves visibility by 22-41% — our data confirms these signals differentiate cited from non-cited pages, even if the observational effect sizes are smaller than the experimental ones. For newer or smaller domains, the priority is building topical authority through comprehensive, consistent content. The data shows page optimization doesn't produce measurable citation differences until domain authority reaches a threshold (roughly Q4, or DataForSEO Rank above ~611). SE Ranking found that domains with profiles on review platforms like G2, Trustpilot, and Yelp have 3x higher citation chances — this kind of brand-building activity is the path for smaller domains. For everyone: the largest factor in whether AI cites your content is whether it matches what someone asked. The structural and authority signals measured in this study are necessary conditions, not sufficient ones. The best-optimized page on the most authoritative domain won't get cited if nobody asks a question it answers. And even when a page is cited, Wu et al. (Nature Communications, 2025) found that 50-90% of AI citations don't fully support the claims they're attached to — being cited is not the same as being cited accurately.
The Broader Landscape
The GEO/AEO space has grown rapidly — over 200 tools exist as of early 2026, each approaching AI visibility from a different angle. Several have published valuable research that informed our understanding: Ahrefs published Spearman correlations across 75,000 brands, demonstrating the dominance of off-site brand signals in AI visibility. Their Brand Radar tool tracks citation patterns at massive scale. SE Ranking published SHAP-based feature importance rankings from an XGBoost model across 216,000+ pages, providing one of the first statistical weight sets for AI citation factors. AirOps published a cited-vs-non-cited comparison across 15 industries, one of the first studies with a proper control set design. Sellm published a 5-factor importance breakdown from ML classification of 400K+ ChatGPT-retrieved pages, offering one of the few page-vs-domain decompositions. Ziptie / Mike King analyzed 1M+ AI responses and found that 88% of AI-cited URLs don't rank in Google's top 10, demonstrating the divergence between traditional SEO and AI citation. Each of these studies covers different ground, uses different methods, and answers different questions. Our contribution is combining several elements — multi-platform citation collection, large-scale signal extraction with matched controls, domain-level authority data, combined analysis, and per-model weights — into a single methodology. We encourage anyone interested to read the studies above alongside ours.