G-068: Real-world metadata sourcing & aggregator independence (HellaThis)

Approach for building an independently-provenanced real-world metadata corpus (movies/TV/books/music) so HellaThis can run off OLN instead of aggregators. The key correction: the risk is contract (ToS) and database rights, not fact copyright — so the corpus becomes "ours" via independent primary/open-source provenance, not via QA-laundering aggregator data.

Why this matters

HellaThis wants off the aggregator treadmill (TMDB et al.) and onto OLN's graph (Entry 008, one-way OLN→HellaThis feed, G-029). That graph is OLN's network-level entity layer for real-world works (Entry 002, G-048). How chroniclers source it decides whether it is legally durable.

Legal framing (high level — not legal advice; counsel required)

Facts are not copyrightable (US: Feist). A release date, runtime, or cast list is free regardless of where it was seen. So re-keying individual facts, or QA over them, cleanly solves the copyright/expression problem.
But the real risks are elsewhere:
- Contract (ToS). Using an aggregator's API binds you to its terms (attribution, no-competing-product, no-permanent-storage, delete-on-termination) regardless of copyright. Verification does not erase a contract — you cannot launder data out of terms you agreed to.
- Database rights (EU/UK). Copying a substantial part of a compilation is a separate risk even when each fact is free.
- Images & prose (posters, synopses) are copyrighted expression — never ingest; facts only.

The mechanism that works: independent provenance

The corpus becomes "ours" not by transforming aggregator data but by sourcing each Fact independently from the start:

Seed from CC0/open + primary sources — Wikidata (CC0), MusicBrainz, OpenLibrary, plus official/primary sources via G-066. Legally clean skeleton.
Aggregators as at-most a human cross-check — surfaced as a reference-only signal (the G-044 pattern), never stored as the canonical value; ideally absent given the independence goal.
The promotion rule — a Fact graduates to canonical only with non-aggregator provenance (Entry 020 tier + G-026). "Saw it on TMDB" stays Provisional. Chronicler verification = attaching the independent citation; that is the value.

Aggregator data as an ML validation oracle?

A tempting middle path: don't store aggregator values, just use them at QA time as an oracle for ML jobs to validate/score our independently-extracted Facts. Assessment:

Point cross-check vs. training are different risks. Transient comparison (extract independently → compare to the aggregator value → keep only a match/mismatch flag → discard theirs) is low copyright/DB-rights risk. Training an ML model on their outputs is much higher risk and reciprocity-inconsistent — the exact move we object to in Entry 030 (others training on our data).
The gate is still the contract. API access binds you to ToS that commonly forbid using the data to build a competing/derivative dataset — which an ML validation pipeline feeding our corpus arguably is. "We don't store it" cures copyright, not the contract.
An oracle is a hidden dependency. If our confidence and quality are tuned against TMDB, we are silently dependent on TMDB even without storing a value — defeating the independence goal. An independent corpus needs an independent oracle.
Clean resolution: ensemble the permitted sources. Use an ensemble of license-permitted sources (Wikidata CC0, MusicBrainz, OpenLibrary, primary/ official via G-066) as the validation oracle. Agreement across independent permitted sources is a stronger signal than any single aggregator, feeds confidence scoring (G-026) and reliability (G-059), and is fully clean — you can even train on CC0. The aggregator becomes unnecessary, not merely risky.

Open decisions

Which open datasets are in scope and their license handling (G-039 provider licensing review; G-044/G-002).
Whether any aggregator API is used at all, and if so under what stored-vs- reference boundary that honors its ToS (G-067 compliance monitor).
Validation-oracle policy — confirm the ensemble-of-permitted-sources oracle and an explicit no-training-on-restricted-sources rule (counsel to review TMDB ML/derivative/competing-use clauses).
The non-aggregator-provenance promotion rule as a hard gate vs. soft signal.
Sequencing: build the network real-world entity layer now (with the rule baked in) vs. defer until OLN core use-cases are covered.

Entry 008 — HellaThis as sister product (the consumer) · G-029 — its licensing
Entry 002 — canonical dual-surface · G-048 — entity-type vocabulary
Entry 019 — provider-seeded cold-start · Entry 020 — citation tiers
G-026 — Fact provenance/quality · G-039 — provider licensing review
G-044 — CC-BY-SA / reference-only pattern · G-002 — IP posture
G-066 — source nomination & pattern ingest · G-067 — compliance monitoring

Why this matters

Legal framing (high level — not legal advice; counsel required)

The mechanism that works: independent provenance

Aggregator data as an ML validation oracle?

Open decisions

Related