Decision
A Fact is only as trustworthy as the entity it describes. So the pipeline gates extraction on type confidence: it will not auto-extract facts from an entity until it is confident what kind of thing that entity is. Confidence is graded by how many independent signals agree:
- Corroborated — an authoritative type (Wikidata) and the source's own infobox agree. Highest trust; auto-extracts.
- Single-signal — only one weak indicator, with nothing to corroborate it. Held for human validation.
- Conflicting — the infobox disagrees with the chosen type. Held; a conflict is a red flag, not a coin toss.
- No usable signal — held.
Corroborated infobox-agreement was promoted to "trusted" once we were satisfied that agreeing independent sources clear the bar; everything below it waits for a human. The default is fail-closed: when in doubt, don't publish.
Why
The whole value of the graph is that a fact attached to the wrong kind of thing is worse than no fact at all — it is confidently wrong, and it propagates. A character's "homeworld" makes no sense on a film; a film's "box office" makes no sense on a person. Getting the type right first is the cheapest place to stop an entire class of garbage, and corroboration across independent sources is a far better confidence signal than any single source's say-so. This is the same fail-closed instinct as source compliance (Entry 033) and citation tiers (Entry 020), applied one layer earlier — to the entity itself.
It also keeps humans where they add value: not rubber-stamping the obvious, but adjudicating the genuinely ambiguous middle band.
Open threads
- The single schema source: the per-type vocabulary that display uses and the vocabulary the extractor asks for should be one curated source, not two that can drift. Approach and timing are still to settle.
- Spot-validating samples of the ambiguous band to turn the confidence thresholds from a careful guess into measured accuracy.
Related
- Entry 020 — citation strength tiers
- Entry 022 — the AI / ML role split (propose, dispose, enforce)
- Entry 033 — fail-closed source compliance
- Entry 035 — entity facts as registry data (the display side of the same schema)
- G-026 — Fact graph data quality
- G-060 — the review queue where held facts are adjudicated