Network/Register/Community Source Nomination And Pattern Authored Ingest
Gap Register
G-066Public

Community source nomination & pattern-authored ingest

Tier 2 — Structurally thin, not launch-blocking
Status
Open — design drafted
Owner
Creator
Why now
Cold-start corpus growth needs more than official providers (Entry 019). Letting the community nominate arbitrary sources and having trusted members co-author reusable extraction patterns with AI could scale the corpus fast — but ingesting third-party sites is a legal and quality minefield that requires a gated pipeline before any of it is safe.
Depends on
G-044, G-046
Related
Entry 019, Entry 021, Entry 022, Entry 020, Entry 030, G-044, G-002, G-046, G-041, G-026, G-059, G-060, G-030

A pipeline where the community nominates sources, high-reliability members co-author versioned extraction patterns with AI, and those patterns emit Provisional Facts at the source's citation tier into the review queue (G-060) — validated continuously by sampling + pattern reliability. Gated first by license classification. Runs alongside gamified manual chronicler entry.

Why this matters

Official providers (Entry 019) only cover so much; the long tail of fandom knowledge lives on community sites. This generalizes the Memory Alpha wedge (G-030) into a repeatable "onboard almost any source" capability — and closes a loop: community-ingested Facts land as Provisional and ride the same gamified review queue (G-060), so chroniclers are simultaneously the scaling validation workforce.

Building an exhaustive source list (method)

"Exhaustive" is a process, not a static list — make the corpus generate it:

  1. Derive the list from the corpus. Every title in the CC0 metadata (Wikidata, MusicBrainz, OpenLibrary) carries a distributor/label/publisher field. Aggregate those across all titles → the distinct set, ranked by title count, is the exhaustive list, auto-prioritized by coverage unlocked.
  2. Seed from industry membership directories (near-complete distributor rosters): IFTA (indie film/TV), MPA (majors), RIAA/IFPI (music), AAP/IPA (books), ESA (games).
  3. Prioritize distributors over producers — one distributor carries many titles/labels (Banijay, Fremantle, ITV/BBC Studios; The Orchard, ADA, Believe; Ingram), so connecting it covers catalogs at once.
  4. Coverage metric drives onboarding — track "% of titles whose distributor has a connected press source"; the uncovered slice surfaces the next targets, gamifiable as quests (G-059).
  5. Nomination fills the long tail (below); G-067 keeps each source's terms current.

Proposed design

  • Rights-holder / distributor as a registry entity — model the org that controls assets, not just the URL: media type, territories, catalog scope, press-portal URL, access type, terms, status. Distributors are the leverage points.
  • Source Registry — every source (Wikidata, Fandom, nominated sites) is a first-class entity carrying: license class, citation tier (Entry 020), status (nominated → vetted → pattern-authored → active → drifted/quarantined), and a reliability score.
  • Nomination + upvote queue — anyone may nominate a source; community upvotes prioritize which to pattern next. Cheap by design; abuse is bounded because the expensive step (pattern-authoring) is gated.
  • License gate first — classify each source before anything ingests: ingest-as-Facts / reference-signal-only (the G-044 holding pattern) / reject. Tractable because facts are not copyrightable — extracting structured Facts is far safer than copying prose — and because we apply our own Entry 030 stance reciprocally: respect robots/ToS, attribute, don't take others' labor without credit.
  • Pattern Studio (extends the Ingest Studio, Entry 021) — a high-reliability (G-059) + high-Power member co-authors an extraction pattern with AI from sampled pages. A pattern = selectors/schema map + entity-type mapping (G-048) + citation-tier assignment + confidence rules. It is versioned and attributed (Layer 9), drift-detected (G-041), and is itself the highest-tier chronicler contribution.
  • Automated run → Provisional, never Canon — a pattern emits Facts at the source's citation tier into the review queue (G-060). Fan-tier sources produce Provisional Facts that must climb via confirmations; CC0/provider sources enter higher.
  • Validate-as-it-scales — sample a %, check against gold (G-059), track each pattern's downstream revert/contradiction rate as a pattern reliability score (mirror of reviewer reliability); auto-throttle or quarantine on rising error or detected drift.
  • Two converging lanes — automated patterns (high-volume) and gamified manual chronicler entry (high-quality, the prose→Fact loop, G-057/G-059) both land as Provisional and ride the same validation.

Open decisions

  • License taxonomy and per-class handling (depends on G-044, G-002).
  • Pattern-author permission threshold (reliability + Power floor).
  • Pattern reliability math and quarantine triggers; sampling rate.
  • Nomination anti-abuse, dedup, and prioritization mechanics.
  • AI cost envelope for sample-pull + extraction (G-046).
  • A written scraping-ethics policy consistent with Entry 030.

Related

  • Entry 019 — provider-seeded cold-start (the official-source predecessor)
  • Entry 021 — Ingest Studio (Pattern Studio extends it) · Entry 022 — AI role split
  • Entry 020 — citation tiers · Entry 030 — crawler ethics applied reciprocally
  • G-044 — CC-BY-SA derivative scope · G-002 — IP/copyright posture (license gate)
  • G-046 — AI-proposed mappings cost · G-041 — mapping versioning & drift
  • G-026 — data quality · G-059 — reliability spine · G-060 — review queue
  • G-030 — Memory Alpha migration (the wedge this generalizes)