Where this runs
In the batch/data-pipeline layer (Motherbrain, which already pulls Wikidata), not the web app — per the AI/ML role split (Entry 022: offline batch), the Ingest Studio (Entry 021), and the storage architecture (Entry 023). Motherbrain is external to this repo; this is the process spec for that team.
What this job ships
A recurring job implementing the exhaustive-source method (G-066):
- Derive the distributor/label/publisher set from the CC0 corpus, with a title count per org.
- Rank by title count → coverage-weighted priority list.
- Seed the top with industry directories (IFTA, MPA, RIAA/IFPI, AAP/IPA, ESA).
- Write to the Source/Distributor Registry (G-066) as rights-holder entities (media type, territories, catalog scope, press-portal URL, access type, terms, status), deduped against existing rows.
- Emit a coverage metric — % of titles whose distributor has a connected press source — to drive onboarding and gamified quests (G-059).
Starter queries (Wikidata)
Film distributors, ranked:
SELECT ?distributor ?distributorLabel (COUNT(?film) AS ?titles) WHERE {
?film wdt:P31/wdt:P279* wd:Q11424 ; # instance of: film
wdt:P750 ?distributor . # distributor
SERVICE wikibase:label { bd:serviceParam wikibase:language "en". }
} GROUP BY ?distributor ?distributorLabel
ORDER BY DESC(?titles)
Same shape, swapping type + property:
- TV —
wd:Q5398426(series) +P750(distributor) - Music — album/release +
P264(record label); prefer MusicBrainz label-release data at scale - Books — work/edition +
P123(publisher); prefer OpenLibrary edition dumps at scale
Operational notes
- Run against the Wikidata dumps, not WDQS — the live query service times out
on full-corpus counts. WDQS with
LIMITis fine for a first cut only. - Cadence — same schedule as the existing Wikidata pulls; make it part of the standing pipeline, not a one-off.
- License — Wikidata is CC0, so harvesting this list is clean; the No-Gotchas monitor (G-067) governs each source's terms once registered.
What it feeds
- The Source/Distributor Registry and exhaustive-coverage process (G-066).
- The studio press-asset connection (P-024) — tells it which rights-holders to connect next, by coverage.
Why this matters
It turns "exhaustive source coverage" from a hand-maintained list into a self-prioritizing, self-completing process — the corpus names the rights-holders, ranked by how much each unlocks.