Network/Register/Ai Crawler And Data Licensing Posture
Gap Register
G-058Public

AI crawler & data-licensing posture

Tier 1 — Existential, blocks launch
Status
Open — not started
Owner
Creator
Why now
Once an answer engine has crawled the Fact graph for training, the asset is given away and can't be clawed back. The allow/block policy and the licensing tiers must exist before any meaningful crawler traffic arrives — and they directly gate both the AEO discoverability bet (G-055) and the data-licensing value of the graph (G-026, G-029).
Depends on
G-044
Related
Entry 030, G-055, G-026, G-029, G-044, G-005

Operationalize the "citable, not trainable" posture (Entry 030): a per-crawler-class allow/block policy, WAF-level enforcement behind robots.txt, and licensing tiers that keep OLN freely citable while holding the structured graph in reserve as the asset a partnership buys.

Why this matters

Entry 030 decides the posture — allow citation crawlers, block training crawlers, license the structured graph not the CC-BY-SA prose. What remains open is the operationalization, and it is time-sensitive: a training crawl, once it happens, can't be undone. Open decisions:

  • Per-class crawler policy — the concrete robots.txt and the maintained user-agent lists behind it. Allow live citation/retrieval (OAI-SearchBot, PerplexityBot, AI-Overview fetch) and classic search (Googlebot); block training (GPTBot, CCBot, Google-Extended) by default. Who owns the list as new bots appear, and what the default is for an unknown AI user-agent.
  • Enforcement beyond the honor systemrobots.txt and Google-Extended are honored by the major labs but ignored by many scrapers, so "block" is only real with WAF-level enforcement (e.g. Cloudflare AI-bot rules). Decide the enforcement layer and how aggressive it is, weighed against false-positives on legitimate citation crawlers we want.
  • Licensing tiers — define exactly what is free vs. reserved. Free: live, attributed citation. Reserved (the partnership product): bulk export, real-time/API access, the normalized structured graph, and explicit training rights. Pricing/terms can wait; the boundary cannot.
  • CC-BY-SA boundary (depends on G-044) — the imported prose is share-alike and cannot be fenced; the licensable asset is strictly the structure OLN creates (Facts, normalized entities, relationships, provenance, freshness). The policy must be written so it does not depend on the still-open ingest question (G-044) being resolved first.
  • Timing / sequencing — seed citability now to build authority; hold the premium tier in reserve until the data set has the density that makes a partnership worth negotiating. Define the trigger for opening the reserved tier.

Related

  • Entry 030 — the deciding posture (citable, not trainable)
  • G-055 — discoverability moat: this policy is what makes "citable" real
  • G-026 — Fact-graph data quality: the structured asset being gated/licensed
  • G-029 — HellaThis intercompany licensing: precedent for licensing the graph
  • G-044 — Fandom CC-BY-SA derivative scope: bounds what is licensable at all
  • G-005 — AI policy: the contributor-attribution principle this posture extends