A pipeline where the community nominates sources, high-reliability members co-author versioned extraction patterns with AI, and those patterns emit Provisional Facts at the source's citation tier into the review queue (G-060) — validated continuously by sampling + pattern reliability. Gated first by license classification. Runs alongside gamified manual chronicler entry.
Why this matters
Official providers (Entry 019) only cover so much; the long tail of fandom knowledge lives on community sites. This generalizes the Memory Alpha wedge (G-030) into a repeatable "onboard almost any source" capability — and closes a loop: community-ingested Facts land as Provisional and ride the same gamified review queue (G-060), so chroniclers are simultaneously the scaling validation workforce.
Building an exhaustive source list (method)
"Exhaustive" is a process, not a static list — make the corpus generate it:
- Derive the list from the corpus. Every title in the CC0 metadata (Wikidata, MusicBrainz, OpenLibrary) carries a distributor/label/publisher field. Aggregate those across all titles → the distinct set, ranked by title count, is the exhaustive list, auto-prioritized by coverage unlocked.
- Seed from industry membership directories (near-complete distributor rosters): IFTA (indie film/TV), MPA (majors), RIAA/IFPI (music), AAP/IPA (books), ESA (games).
- Prioritize distributors over producers — one distributor carries many titles/labels (Banijay, Fremantle, ITV/BBC Studios; The Orchard, ADA, Believe; Ingram), so connecting it covers catalogs at once.
- Coverage metric drives onboarding — track "% of titles whose distributor has a connected press source"; the uncovered slice surfaces the next targets, gamifiable as quests (G-059).
- Nomination fills the long tail (below); G-067 keeps each source's terms current.
Proposed design
- Rights-holder / distributor as a registry entity — model the org that controls assets, not just the URL: media type, territories, catalog scope, press-portal URL, access type, terms, status. Distributors are the leverage points.
- Source Registry — every source (Wikidata, Fandom, nominated sites) is a first-class entity carrying: license class, citation tier (Entry 020), status (nominated → vetted → pattern-authored → active → drifted/quarantined), and a reliability score.
- Nomination + upvote queue — anyone may nominate a source; community upvotes prioritize which to pattern next. Cheap by design; abuse is bounded because the expensive step (pattern-authoring) is gated.
- License gate first — classify each source before anything ingests:
ingest-as-Facts/reference-signal-only(the G-044 holding pattern) /reject. Tractable because facts are not copyrightable — extracting structured Facts is far safer than copying prose — and because we apply our own Entry 030 stance reciprocally: respect robots/ToS, attribute, don't take others' labor without credit. - Pattern Studio (extends the Ingest Studio, Entry 021) — a high-reliability (G-059) + high-Power member co-authors an extraction pattern with AI from sampled pages. A pattern = selectors/schema map + entity-type mapping (G-048) + citation-tier assignment + confidence rules. It is versioned and attributed (Layer 9), drift-detected (G-041), and is itself the highest-tier chronicler contribution.
- Automated run → Provisional, never Canon — a pattern emits Facts at the source's citation tier into the review queue (G-060). Fan-tier sources produce Provisional Facts that must climb via confirmations; CC0/provider sources enter higher.
- Validate-as-it-scales — sample a %, check against gold (G-059), track each pattern's downstream revert/contradiction rate as a pattern reliability score (mirror of reviewer reliability); auto-throttle or quarantine on rising error or detected drift.
- Two converging lanes — automated patterns (high-volume) and gamified manual chronicler entry (high-quality, the prose→Fact loop, G-057/G-059) both land as Provisional and ride the same validation.
Open decisions
- License taxonomy and per-class handling (depends on G-044, G-002).
- Pattern-author permission threshold (reliability + Power floor).
- Pattern reliability math and quarantine triggers; sampling rate.
- Nomination anti-abuse, dedup, and prioritization mechanics.
- AI cost envelope for sample-pull + extraction (G-046).
- A written scraping-ethics policy consistent with Entry 030.
Related
- Entry 019 — provider-seeded cold-start (the official-source predecessor)
- Entry 021 — Ingest Studio (Pattern Studio extends it) · Entry 022 — AI role split
- Entry 020 — citation tiers · Entry 030 — crawler ethics applied reciprocally
- G-044 — CC-BY-SA derivative scope · G-002 — IP/copyright posture (license gate)
- G-046 — AI-proposed mappings cost · G-041 — mapping versioning & drift
- G-026 — data quality · G-059 — reliability spine · G-060 — review queue
- G-030 — Memory Alpha migration (the wedge this generalizes)