Approach for building an independently-provenanced real-world metadata corpus (movies/TV/books/music) so HellaThis can run off OLN instead of aggregators. The key correction: the risk is contract (ToS) and database rights, not fact copyright — so the corpus becomes "ours" via independent primary/open-source provenance, not via QA-laundering aggregator data.
Why this matters
HellaThis wants off the aggregator treadmill (TMDB et al.) and onto OLN's graph (Entry 008, one-way OLN→HellaThis feed, G-029). That graph is OLN's network-level entity layer for real-world works (Entry 002, G-048). How chroniclers source it decides whether it is legally durable.
Legal framing (high level — not legal advice; counsel required)
- Facts are not copyrightable (US: Feist). A release date, runtime, or cast list is free regardless of where it was seen. So re-keying individual facts, or QA over them, cleanly solves the copyright/expression problem.
- But the real risks are elsewhere:
- Contract (ToS). Using an aggregator's API binds you to its terms (attribution, no-competing-product, no-permanent-storage, delete-on-termination) regardless of copyright. Verification does not erase a contract — you cannot launder data out of terms you agreed to.
- Database rights (EU/UK). Copying a substantial part of a compilation is a separate risk even when each fact is free.
- Images & prose (posters, synopses) are copyrighted expression — never ingest; facts only.
The mechanism that works: independent provenance
The corpus becomes "ours" not by transforming aggregator data but by sourcing each Fact independently from the start:
- Seed from CC0/open + primary sources — Wikidata (CC0), MusicBrainz, OpenLibrary, plus official/primary sources via G-066. Legally clean skeleton.
- Aggregators as at-most a human cross-check — surfaced as a reference-only signal (the G-044 pattern), never stored as the canonical value; ideally absent given the independence goal.
- The promotion rule — a Fact graduates to canonical only with non-aggregator provenance (Entry 020 tier + G-026). "Saw it on TMDB" stays Provisional. Chronicler verification = attaching the independent citation; that is the value.
Aggregator data as an ML validation oracle?
A tempting middle path: don't store aggregator values, just use them at QA time as an oracle for ML jobs to validate/score our independently-extracted Facts. Assessment:
- Point cross-check vs. training are different risks. Transient comparison (extract independently → compare to the aggregator value → keep only a match/mismatch flag → discard theirs) is low copyright/DB-rights risk. Training an ML model on their outputs is much higher risk and reciprocity-inconsistent — the exact move we object to in Entry 030 (others training on our data).
- The gate is still the contract. API access binds you to ToS that commonly forbid using the data to build a competing/derivative dataset — which an ML validation pipeline feeding our corpus arguably is. "We don't store it" cures copyright, not the contract.
- An oracle is a hidden dependency. If our confidence and quality are tuned against TMDB, we are silently dependent on TMDB even without storing a value — defeating the independence goal. An independent corpus needs an independent oracle.
- Clean resolution: ensemble the permitted sources. Use an ensemble of license-permitted sources (Wikidata CC0, MusicBrainz, OpenLibrary, primary/ official via G-066) as the validation oracle. Agreement across independent permitted sources is a stronger signal than any single aggregator, feeds confidence scoring (G-026) and reliability (G-059), and is fully clean — you can even train on CC0. The aggregator becomes unnecessary, not merely risky.
Open decisions
- Which open datasets are in scope and their license handling (G-039 provider licensing review; G-044/G-002).
- Whether any aggregator API is used at all, and if so under what stored-vs- reference boundary that honors its ToS (G-067 compliance monitor).
- Validation-oracle policy — confirm the ensemble-of-permitted-sources oracle and an explicit no-training-on-restricted-sources rule (counsel to review TMDB ML/derivative/competing-use clauses).
- The non-aggregator-provenance promotion rule as a hard gate vs. soft signal.
- Sequencing: build the network real-world entity layer now (with the rule baked in) vs. defer until OLN core use-cases are covered.
Related
- Entry 008 — HellaThis as sister product (the consumer) · G-029 — its licensing
- Entry 002 — canonical dual-surface · G-048 — entity-type vocabulary
- Entry 019 — provider-seeded cold-start · Entry 020 — citation tiers
- G-026 — Fact provenance/quality · G-039 — provider licensing review
- G-044 — CC-BY-SA / reference-only pattern · G-002 — IP posture
- G-066 — source nomination & pattern ingest · G-067 — compliance monitoring