What On-Page Signals Actually Differentiate AI-Cited Pages

By Andrew Coffey · 2026-03-24

We analyzed the HTML of 18,129 pages cited by AI and 4,622 pages that rank well but don't get cited, extracted 213 structural signals from each, and measured what's different. This post covers what we found at the page level — no domain authority, no backlinks, just what's on the page. The short answer: the effects are small but real, and they're not what most people expect.

The 10 Scoring Composites, Ranked

Individual signals are noisy. A single signal like hasCanonical (d=0.086) has a small effect on its own. But when you group related signals into composites — combining hasCanonical with hasLangAttribute, hasDoctype, hasViewportMeta, and others — the consistent directional tendency adds up while the noise cancels out. Here are the 10 composites ranked by how strongly they differentiate cited from non-cited pages, using the volume-weighted cross-model weights (our production scoring set): Crawl & Index Signals leads at 14.8% weight (d=0.158), measuring HTML fundamentals like canonical, lang, viewport, meta description, doctype. RAG Retrieval Suitability at 13.0% (d=0.140) measures semantic HTML, article tags, clean section structure, heading-to-content ratio. Content Relevance at 12.6% (d=0.136) covers word count, title length, paragraph structure, content depth. Engagement Cues at 12.5% (d=0.135) covers CTAs, social links, contact info, FAQ content, comment sections. Multimodal Readiness at 11.4% (d=0.123) covers images, alt text coverage, image-to-text ratio. AI Readability at 8.8% (d=0.095) covers heading hierarchy, chunk size, readability grade, short paragraphs. Citation Suitability at 7.2% (d=0.077) covers statistics, dates, author info, lexical diversity, numerical content. Structured Metadata at 6.7% (d=0.072) covers JSON-LD, Open Graph, Twitter Cards, schema types. Page Freshness at 6.7% (d=0.072) covers publication date, modification date, copyright year. Domain Expertise at 6.3% (d=0.068) covers author info, about links, organization schema. The distribution is remarkably flat. The top metric is 14.8%, the bottom is 6.3%. No single dimension dominates. This is different from what you might expect if you've read advice saying schema markup or freshness is "the most important factor" — the data says everything matters a little, nothing matters a lot, and the basics matter most.

The Strongest Individual Signals

Filtering out behavioral signals, here are the top structural signals that differentiate cited from non-cited pages in JS-rendered mode: lexicalDiversity d=0.268 (cited 0.345 vs control 0.302), shortParagraphPct d=0.250 (59.4% vs 51.9%), hasTitle d=0.239 (92.0% vs 84.6%), hasDoctype d=0.234 (91.4% vs 83.9%), hasOpenGraph d=0.219 (83.0% vs 74.2%), hasLangAttribute d=0.217 (89.9% vs 82.6%), hasAlternateLink d=0.215 (53.2% vs 42.6%), hasViewportMeta d=0.208 (90.7% vs 84.0%), hasCallToAction d=0.202 (83.2% vs 75.1%), hasMetaDescription d=0.197 (82.1% vs 74.1%). The pattern: cited pages are more likely to have basic HTML hygiene implemented correctly. The gaps aren't huge — 92% vs 85% for having a title tag — but they're consistent across dozens of signals and they accumulate. lexicalDiversity at d=0.268 is the strongest non-behavioral signal in the entire dataset. Cited pages use more varied vocabulary (0.345 vs 0.302 on a 0-1 scale). This likely reflects content quality — pages written with care use more diverse language than pages churned out quickly.

What Signals Actually Hurt

Some signals that sound like good practice show consistent negative effects: sponsoredLinkCount d=-0.189 (more sponsored/affiliate links = less likely cited), hasNoarchive d=-0.198 (blocking archiving = less likely cited), externalLinkCount d=-0.149 (more outbound links = less likely cited), hasFigcaptions d=-0.163 (figure captions = less likely cited), maxSectionWordCount d=-0.134 (longer max sections = less likely cited), wordCount d=-0.091 (longer pages = slightly less likely cited). hasFigcaptions being negative is counterintuitive — shouldn't image captions help? The likely explanation is that figcaptions are common on Wikipedia and academic pages, which are overrepresented in the control set. The sponsored link finding is more straightforward. Pages heavy with affiliate content are measurably less likely to be cited by AI.

JS vs Non-JS: A Split Nobody Else Has Measured

Most AI crawlers — GPTBot, ClaudeBot, PerplexityBot — do not execute JavaScript. They see your static HTML. Gemini is the exception; Google's crawlers render JavaScript. We fetched every page in both modes and produced separate weight sets for each. Stronger in JS mode (what Gemini sees): Schema and metadata signals (hasOrganizationSchema, hasArticleSchema, hasBreadcrumbSchema) — these are often injected by JavaScript frameworks and invisible in static HTML. Stronger in non-JS mode (what most AI crawlers see): Content structure signals (hasTableHeaders d=0.128, comparisonTableDetected d=0.119) — these exist in the static HTML and stand out more when JavaScript-injected content is absent. If your site relies heavily on JavaScript to render schema markup, most AI crawlers can't see it. Only Gemini benefits. JS mode composite effects are roughly 2x stronger than non-JS mode (top composite d=0.304 vs d=0.158). What to do: Make sure your critical metadata and schema are in the static HTML, not injected by client-side JavaScript. Server-side render anything you want AI crawlers to see.

What the Princeton GEO Paper Got Right

The Princeton GEO paper (Aggarwal et al., KDD 2024) found that adding statistics to pages improved visibility by 22-41% and adding quotations improved it by 28-37% in controlled experiments. Our data is consistent with their direction. statisticsDetected shows d=0.088 (JS) and d=0.069 (non-JS) — positive in both modes. The effect is much smaller than Princeton's because we're measuring observational correlation vs their interventional effect. Most pages already have statistics (76% in our cited set), so the observational gap is compressed. The Princeton paper remains the strongest evidence that specific content changes cause more citations. Our data confirms the direction and adds the nuance that statistics are one of many factors, not the dominant one.

What To Actually Fix

Based on the data, here's the priority order for page-level optimization. Fix first (highest impact, easiest): Add missing HTML fundamentals — doctype, lang attribute, canonical, viewport meta, meta description. Remove noarchive directives if present. Ensure title tags are present and descriptive. Fix second (moderate impact): Write in shorter paragraphs — cited pages average 59% short paragraphs vs 52% for controls. Increase vocabulary diversity — don't repeat the same phrases. Break up long sections — cited pages have shorter maximum sections. Add Open Graph and Twitter Card metadata. Fix third (smaller but real impact): Add structured data (JSON-LD) — but make sure it's in the static HTML, not JS-only. Include statistics and numerical data in your content. Add publication and modification dates. Ensure images have alt text. Don't bother with (no measurable effect or negative): Adding figure captions specifically for AI (negative correlation). Making pages longer for the sake of length (negative correlation). Heavy sponsored/affiliate link placement on pages you want cited (negative correlation).