P-025: Distributor / source harvest (standing job) — OLN Roadmap

Where this runs

In the batch/data-pipeline layer (Motherbrain, which already pulls Wikidata), not the web app — per the AI/ML role split (Entry 022: offline batch), the Ingest Studio (Entry 021), and the storage architecture (Entry 023). Motherbrain is external to this repo; this is the process spec for that team.

What this job ships

A recurring job implementing the exhaustive-source method (G-066):

Derive the distributor/label/publisher set from the CC0 corpus, with a title count per org.
Rank by title count → coverage-weighted priority list.
Seed the top with industry directories (IFTA, MPA, RIAA/IFPI, AAP/IPA, ESA).
Write to the Source/Distributor Registry (G-066) as rights-holder entities (media type, territories, catalog scope, press-portal URL, access type, terms, status), deduped against existing rows.
Emit a coverage metric — % of titles whose distributor has a connected press source — to drive onboarding and gamified quests (G-059).

Starter queries (Wikidata)

Film distributors, ranked:

SELECT ?distributor ?distributorLabel (COUNT(?film) AS ?titles) WHERE {
  ?film wdt:P31/wdt:P279* wd:Q11424 ;   # instance of: film
        wdt:P750 ?distributor .          # distributor
  SERVICE wikibase:label { bd:serviceParam wikibase:language "en". }
} GROUP BY ?distributor ?distributorLabel
ORDER BY DESC(?titles)

Same shape, swapping type + property:

TV — wd:Q5398426 (series) + P750 (distributor)
Music — album/release + P264 (record label); prefer MusicBrainz label-release data at scale
Books — work/edition + P123 (publisher); prefer OpenLibrary edition dumps at scale

Operational notes

Run against the Wikidata dumps, not WDQS — the live query service times out on full-corpus counts. WDQS with LIMIT is fine for a first cut only.
Cadence — same schedule as the existing Wikidata pulls; make it part of the standing pipeline, not a one-off.
License — Wikidata is CC0, so harvesting this list is clean; the No-Gotchas monitor (G-067) governs each source's terms once registered.

What it feeds

The Source/Distributor Registry and exhaustive-coverage process (G-066).
The studio press-asset connection (P-024) — tells it which rights-holders to connect next, by coverage.

Why this matters

It turns "exhaustive source coverage" from a hand-maintained list into a self-prioritizing, self-completing process — the corpus names the rights-holders, ranked by how much each unlocks.