Network/Roadmap/Distributor Source Harvest Standing Job
Roadmap
P-025Open

Distributor / source harvest (standing job)

A recurring batch job in the data pipeline (Motherbrain) that derives the exhaustive distributor/label/publisher list from the CC0 corpus, ranks it by coverage, seeds from industry directories, and writes to the Source/Distributor Registry. The corpus-driven engine behind exhaustive source coverage (G-066) and the studio press connection (P-024).

Milestone
Near-term
Posted credit value
25credits
Founder credits accrued
0of 25
Owner
Creator
Related
Entry 019, Entry 021, Entry 022, Entry 023, G-066, G-067, P-024

Where this runs

In the batch/data-pipeline layer (Motherbrain, which already pulls Wikidata), not the web app — per the AI/ML role split (Entry 022: offline batch), the Ingest Studio (Entry 021), and the storage architecture (Entry 023). Motherbrain is external to this repo; this is the process spec for that team.

What this job ships

A recurring job implementing the exhaustive-source method (G-066):

  1. Derive the distributor/label/publisher set from the CC0 corpus, with a title count per org.
  2. Rank by title count → coverage-weighted priority list.
  3. Seed the top with industry directories (IFTA, MPA, RIAA/IFPI, AAP/IPA, ESA).
  4. Write to the Source/Distributor Registry (G-066) as rights-holder entities (media type, territories, catalog scope, press-portal URL, access type, terms, status), deduped against existing rows.
  5. Emit a coverage metric — % of titles whose distributor has a connected press source — to drive onboarding and gamified quests (G-059).

Starter queries (Wikidata)

Film distributors, ranked:

SELECT ?distributor ?distributorLabel (COUNT(?film) AS ?titles) WHERE {
  ?film wdt:P31/wdt:P279* wd:Q11424 ;   # instance of: film
        wdt:P750 ?distributor .          # distributor
  SERVICE wikibase:label { bd:serviceParam wikibase:language "en". }
} GROUP BY ?distributor ?distributorLabel
ORDER BY DESC(?titles)

Same shape, swapping type + property:

  • TVwd:Q5398426 (series) + P750 (distributor)
  • Music — album/release + P264 (record label); prefer MusicBrainz label-release data at scale
  • Books — work/edition + P123 (publisher); prefer OpenLibrary edition dumps at scale

Operational notes

  • Run against the Wikidata dumps, not WDQS — the live query service times out on full-corpus counts. WDQS with LIMIT is fine for a first cut only.
  • Cadence — same schedule as the existing Wikidata pulls; make it part of the standing pipeline, not a one-off.
  • License — Wikidata is CC0, so harvesting this list is clean; the No-Gotchas monitor (G-067) governs each source's terms once registered.

What it feeds

  • The Source/Distributor Registry and exhaustive-coverage process (G-066).
  • The studio press-asset connection (P-024) — tells it which rights-holders to connect next, by coverage.

Why this matters

It turns "exhaustive source coverage" from a hand-maintained list into a self-prioritizing, self-completing process — the corpus names the rights-holders, ranked by how much each unlocks.