Wikidata: The Structured Database Fueling Artificial Intelligence

🔥 Contenu recommandé

The Anti-Hallucination Fuel for AI
The Guardians of Structured Knowledge

Everyone knows Wikipedia, the universal encyclopedia where knowledge is presented in long texts. But for artificial intelligence, raw, written information is like searching for a needle in a haystack. Their true treasure is its lesser-known cousin: Wikidata. It’s the structured database that quietly allows AI to be smarter, and above all, more reliable.

Before, a machine had to dissect entire pages to grasp a piece of information. Today, with Wikidata, it consults an ultra-precise identity sheet, with clear links between each piece of data. It’s the blueprint of global knowledge, not for humans, but for algorithms.

Wikidata stands apart from other Wikimedia Foundation projects, such as Wikipedia or Wiktionary, due to its approach. Far from storing knowledge in textual form, this collaborative and multilingual platform organizes interconnected entities. Think of it as a gigantic LEGO set where each brick — a person, a place, a concept — is connected to others by attributes that machines can instantly read and understand. Essentially, it’s a monumental knowledge graph.

Each item on Wikidata has a unique identifier: a « Q » for entities (e.g., Q42 for Douglas Adams) and a « P » for properties (P50 for author, P19 for place of birth). By assembling these identifiers, « RDF triplets » — subject, predicate, object — are created, describing precise and verifiable facts. For example: Douglas Adams (subject) is the author of (predicate) The Hitchhiker’s Guide to the Galaxy (object).

main

user@arch ~/project main ❯

Knowledge expressed in prose, requiring complex semantic analysis by AI.
Unstructured data, difficult to query directly for precise facts.
Higher risk of ambiguity or multiple interpretations for an algorithm.

user@arch ~/project main ❯

Knowledge in the form of RDF triplets, directly readable and exploitable by machines.
Structured data, queryable via precise queries (SPARQL).
Reduces ambiguity, offering verified facts and clear links for AI.

The Anti-Hallucination Fuel for AI

🔥 Contenu recommandé

The sheer volume of data managed by Wikidata is dizzying. By mid-2024, the database had already surpassed the mark of 1.5 billion semantic triplets. This information isn’t just browsable; it’s queryable via a public access point called SPARQL. Concretely, you can ask for « all French writers born in Nantes » and get an exploitable list, without having to read dozens of pages of results.

1.5 Bn+

Semantic Triplets

by mid-2024

24/7

Public Access

via SPARQL

500+

Languages

of available data

So, why are large language models (LLMs) like those from OpenAI or Google so fond of it? To build high-performing AI, you need knowledge, and especially high-quality knowledge. This helps limit what are called « hallucinations » – when AI invents facts – during queries.

The internet is overflowing with data, but its reliability varies. A Wikipedia entry is considered more solid than a forum post. Wikidata pushes this logic to its extreme: the data there is not only reliable but already structured and linked. For an LLM, it’s like going from a pile of books to a perfectly organized library, with indexed cards for each work. The difference is striking.

✅ Positive Points for AI

Reduced Hallucinations: Structured and verified data minimizes the risk of factual errors by AI.

Anchoring and Training: A reliable foundation for training language models and grounding them in reality.

Direct Queries: AI agents can query Wikidata in real-time via SPARQL for verified facts.

⚠️ Points to Consider

Eligibility Criteria: While more flexible than Wikipedia, they remain strict on source verifiability.

Risk of Self-Promotion: Creating personal items is discouraged and may lead to deletion.

Reliance on Contributors: Quality and completeness depend on the community and automated imports.

The Guardians of Structured Knowledge

LLMs’ enthusiasm for Wikidata is explained by three major uses. Firstly, this database feeds Google’s Knowledge Graph, those information boxes that appear to the right of search results. Secondly, it is among the most reused open knowledge bases for training and refining language models. Finally, it can be directly queried by AI agents, allowing them to retrieve verified facts instead of reconstructing them from memory. A real boon for « Generative Engine Optimization » (GEO).

But beware, even if Wikidata is more open than Wikipedia, it’s not a rule-free directory. You can’t just create your own entry there. The eligibility criteria, though flexible, require « verifiable existence » rather than mere « celebrity ». For an item to be accepted, it must meet at least one of these three key points: have a valid link to a Wikimedia project (a Wikipedia page, for example), designate an identifiable entity with reliable and accessible sources, or address a structural need that makes other assertions more useful.

And that’s where many encounter a roadblock. Self-promotion is clearly discouraged. Attempting to create one’s own item constitutes a conflict of interest, and without solid independent sources, an item is destined for deletion. The experience of the source article’s author confirms this: his attempts to create entries for himself and his blog were rejected due to a lack of « notability » proven by third-party sources.

⚠️

Beware of Self-Promotion!

Wikidata strongly discourages the creation of entries by the entity concerned itself. Administrators look for proof of existence and relevance from independent sources, such as a BnF or VIAF record, rather than self-generated profiles.

The lesson is clear: on Wikidata, notability is not declared; it is observed. It must be validated by external traces that one does not control oneself. An ORCID identifier that one fills out oneself carries far less weight than a reference in a national library, for example.

❝

The real lesson of Wikidata? It’s that recognition doesn’t come from what one says about oneself, but from what others, via independent sources, attest to our existence and relevance.

An Anonymous Contributor Structured Data Editor

❞

In five years, AI will be even thirstier for structured data. We can expect an explosion of contributions and tools to feed databases like Wikidata. Language models will increasingly move away from being « stochastic parrots » that reproduce patterns, to become true intelligent agents capable of reasoning based on verified facts. The battle against « fake news » and unsubstantiated generated content will also involve this.

The future of reliable AI will depend on our collective ability to structure global knowledge. Wikidata, in this regard, is not just a database; it is a cornerstone for artificial intelligence that is fairer, more transparent, and above all, knows what it’s talking about.

Chargement de la galerie…

About Rigaud Mickaël

LVL 10 Initié → Rédacteur

🧠 🌍 🎮 Code generation with Claude

🇫🇷 FR 🇬🇧 EN LLMNo Code Low CodeIntelligence Artificielle

Passionate about tech and a Linux enthusiast, I decipher AI with a unique and intense vision to make it useful to all, between robots, rock and the geek universe.

🔥 Contenu recommandé