less_retarded_wiki/wikidata.md

# Wikidata

Wikidata is a large collaborative [Internet](internet.md) [project](project.md) (a sister project of [Wikipedia](wikipedia.md), hosted by Wikimedia Foundation) building a massively huge noncommercial [public domain](public_domain.md) [database](database.md) of [information](information.md) about everything in existence. Well, not literally everything -- there are some rules about what can be included that are similar to those on [Wikipedia](wikipedia.md), e.g. notability (you can't add yourself unless you're notable enough, of course you can't add illegal data etc.). Wikidata records data in a form of so called [knowledge graph](knowledge_graph.md), i.e. it connects items and their properties with statements such as "Earth:location:inner Solar System", creating a mathematical structure called a [graph](graph.md). The whole database is available to anyone for any purpose without any conditions, under [CC0](cc0.md)!

It should be noted that Wikidata is incredibly useful but a bit unfairly overlooked in the shadow of its giant sibling Wikipedia, even though it offers a way to easily obtain large, absolutely [free](free_culture.md) and public domain data sets about anything. The database can be queried with specialized languages so one can obtain let's say coordinates of all terrorist attacks that happened in certain time period, a list of famous male cats, visualize the tree of biological species, list Jews who run restaurants in Asia or any other crazy thing. Wikidata oftentimes contains extra information that's not present in the Wikipedia article about the item and that's not even quickly found by [googling](google.md), and the information is at times also backed by sources just like on Wikipedia, so it's nice to always check Wikidata when researching anything.

Wikidata was opened on 30 October 2012. The first data that were stored were links between different language versions of Wikipedia articles, later Wikipedia started to use Wikidata to store information to display in infoboxes in articles and so Wikidata grew and eventually became a database of its own. As of 2022 there is a little over 100 million items, over 1 billion statements and over 20000 active users. The database dump in [json](json.md), [COMPRESSED](compression.md) with gzip, takes gargantuous 130 GB.

The first items added to the database were the [Universe](universe.md), [Earth](earth.md), [life](life.md), [death](death.md), [human](people.md) etc. Some cool items include [nigger](nigger.md) (Q1455718), fuck her right in the pussy (Q105676108), fart (Q5436447), [LMAO](lmao.md) (Q103319444), [Anarch](anarch.md) (Q114540914) and [this very wiki](lrs_wiki.md) (Q116266837). The structure of the database actually suggests that apart from the obvious usefulness of the data itself we may also toy around with this stuff in other [fun](fun.md) ways, for example we can use wikidata to give a hint of significance of any thing or concept -- given that two similar things predate wikidata itself, we may assume that the one with lower number is likely more significant for having been added earlier. For instance a [dog](dog.md)'s serial number is 144 and [cat](cat.md)'s is 146, so a dog would "win" this kind of internet battle by a tiny margin. Alternatively we can compare the size of the items' records to decide which one wins in significance. Here dog wins again with 200 kilobytes versus cat's 196 kilobytes.

## Database Structure

The database is a [knowledge graph](knowledge_graph.md). It stores the following kinds of records:

- **entities**: Specific "things", concrete or abstract, that exist and are stored in the database. Each one has a unique [ID](id.md), name (not necessarily unique), description and optional aliases (alternative names).
  - **items**: Objects of the real world, their ID is a number prepended with the letter *Q*, e.g. *[dog](dog.md)* (Q144), *[Earth](earth.md)* (Q2), *idea* (QQ131841) or *[Holocaust](holocaust.md)* (Q2763).
  - **properties**: Attributes that items may possess, their ID is a number prepended with the letter *P*, e.g. *instance of* (P31), *mass* (P2067) or *image* (P18). Properties may have constraints (created via statements), for example on values they may take.
- **statements**: Information about items and properties which may possibly link items/properties (entities) with other items/properties. One statement is so called triplet, it contains a subject (item/property), verb (property) and object (value, e.g. item/property, number, string, ...). I.e. a statement is a record of form *entity:property:value*, for example *dog(Q144):subclass of(P279):domestic mammal(Q57814795)*. Statements may link one property with multiple values (by having multiple statements about an entity with the same property), for example a man may have multiple nationalities etc. Statements may also optionally include *qualifiers* that further specify details about the statement, for example specifying the source of the data.

The most important properties are probably **instance of** (P31) and **subclass of** (P279) which put items into [sets](set.md)/classes and establish subsets/subclasses. The *instance of* attribute says that the item is an individual manifestation of a certain class (just like in [OOP](oop.md)), we can usually substitute is with the word "is", for example Blondi (Q155695, [Hitler](hitler.md)'s dog) is an instance of dog (Q144); note that an item can be an instance of multiple classes at the same time. The *subclass of* attribute says that a certain class is a subclass of another, e.g. dog (Q144) is a subclass of pet (Q39201) which is further a subclass of domestic animal (Q622852) etc. Also note that an item can be both an instance and a class.

## How To

Many [libraries](library.md)/[APIs](api.md)/tools exist for accessing wikidata because, unlike shitty [corporations](corporation.md) who guard and obfuscate their data by force, wikidata provides data in friendly ways -- you can even download the whole database dump in several formats including simple ones such as [JSON](json.md) (about 100 GB).

Arguably the easiest way to grab some smaller data is through the online query interface (https://query.wikidata.org/), entering a query (in [SPARQL](sparql.md) language, similar to [SQL](sql.md)) and then clicking download data -- you can choose several formats, e.g. [JSON](json.md) or [CSV](csv.md). That can then be processed further with whatever language or tool, be it [Python](python.md), [LibreOffice](libreoffice.md) Calc etc.

**BEWARE**: the query you enter may easily take a long time to execute and time out, you need to write it nicely which for more complex queries may be difficult if you're not familiar with SPARQL. However wikidata offers online tips on [optimization](optimization.md) of queries and there are many examples right in the online interface which you can just modify to suit you. Putting a limit on the number of results usually helps, also try to reorder the conditions and so on.

Now finally on to a few actual examples. The first one will show one of the most basic and common queries: just listing items with certain properties, specifically video [games](game.md) of the [FPS](fps.md) genre here:

```
SELECT ?item ?itemLabel ?itemDescription WHERE 
{
  ?item wdt:P31 wd:Q7889.    # item is a video game and
  ?item wdt:P136 wd:Q185029. # item is FPS
  
  # this gets item labels (you can append "Label" or "Description" to any requested variable now):
  SERVICE wikibase:label { bd:serviceParam wikibase:language "[AUTO_LANGUAGE],en". }
}
LIMIT 100 # limit to 100 results, make the query faster
```

The language is somewhat intuitive, you basically enter conditions and the database then searches for records that satisfy them, but if it looks hard just see some tutorial.

OK, how about some lulz now? Let's search for human [races](race.md), then count them and compute their average, minimum and maximum height:

```
SELECT ?race ?raceLabel ?raceDescription (COUNT(?human) AS ?count) (AVG(?height) AS ?averageHeight) (MAX(?height) AS ?maxHeight) (MIN(?height) AS ?minHeight) WHERE
{
  { # subquery for optimization (delaying label retrieval)
    SELECT ?human ?race ?height WHERE
    {
      ?human wdt:P31 wd:Q5.  # is human
      ?human wdt:P172 ?race. # has race

      # has height in centimetres:
      ?human     p:P2048                  ?st1.
      ?st1       psv:P2048                ?vn1.
      ?vn1       wikibase:quantityAmount  ?height.
      ?vn1       wikibase:quantityUnit    wd:Q174728.
    } LIMIT 10000
  }

  SERVICE wikibase:label { bd:serviceParam wikibase:language "[AUTO_LANGUAGE],mul,en". }
} GROUP BY ?race ?raceLabel ?raceDescription ORDER BY DESC(?count)
```

Current this returned 331 races, the most frequent (in the database) being "[African American](nigger.md)" with average height 181 cm, then White Americans (171 cm), White People (167 cm) etc. Now let's shit on [privacy](privacy.md) and make an [NSA](nsa.md) style database or people along with personal data such as their names, birth and death dates, causes of death etc.:

```
SELECT ?human ?humanLabel ?humanDescription ?sexLabel ?birthDate ?birthPlaceLabel ?deathDate ?deathCauseLabel
  WITH
  {
    SELECT ?human ?birthDate ?birthPlace ?sex ?deathDate ?deathCause WHERE
    {
      ?human wdt:P31 wd:Q5.
      ?human wdt:P569 ?birthDate.
      ?human wdt:P19 ?birthPlace.
      ?human wdt:P21 ?sex.
      OPTIONAL { ?human wdt:P570 ?deathDate. }
      OPTIONAL { ?human wdt:P509 ?deathCause. }
      FILTER (?birthDate >= "1000-01-01T00:00:00Z"^^xsd:dateTime)
    } LIMIT 10000
  } AS %data
  WHERE
  {
    INCLUDE %data
    SERVICE wikibase:label { bd:serviceParam wikibase:language "[AUTO_LANGUAGE],mul,en" .
  }
}
```

Cool, works pretty nicely. Another [interesting](interesting.md) query may be one about [languages](human_language.md), counting their grammatical cases, tenses etc.:

```
SELECT ?languageLabel ?nativeName ?typeLabel ?countryLabel ?writingLabel ?code1 ?code2 ?speakers ?cases ?tenses ?genders WHERE
{
  {
    SELECT
      ?language
      (MAX(?nn) AS ?nativeName)
      (MAX(?ws) AS ?writing) 
      (MAX(?sp) AS ?speakers)
      (MAX(?c1) AS ?code1)
      (MAX(?c2) AS ?code2)
      (MAX(?ty) AS ?type)
      (MAX(?co) AS ?country)
      ?cases
      ?tenses
      ?genders
    WHERE 
    {
      { ?language wdt:P31 wd:Q33742. } UNION { ?language wdt:P31 wd:Q20162172. } UNION { ?language wdt:P31 wd:Q33215. } # is one of these
      OPTIONAL{?language wdt:P1098 ?sp. }
      OPTIONAL{?language wdt:P1705 ?nn.}
      OPTIONAL{?language wdt:P282 ?ws.}
      OPTIONAL{?language wdt:P218 ?c1.}
      OPTIONAL{?language wdt:P219 ?c2.}
      OPTIONAL{?language wdt:P279 ?ty.}
      OPTIONAL{?language wdt:P2341 ?co.}
      OPTIONAL{ SELECT ?language (COUNT(?tmp) AS ?cases) WHERE { ?language wdt:P2989 ?tmp. } GROUP BY ?language }
      OPTIONAL{ SELECT ?language (COUNT(?tmp) AS ?tenses) WHERE { ?language wdt:P3103 ?tmp. } GROUP BY ?language }
      OPTIONAL{ SELECT ?language (COUNT(?tmp) AS ?genders) WHERE { ?language wdt:P5109 ?tmp. } GROUP BY ?language }
    } GROUP BY ?language ?cases ?tenses ?genders ?countries
   }
  SERVICE wikibase:label { bd:serviceParam wikibase:language "[AUTO_LANGUAGE],mul,en". } # here assign label               
} ORDER BY ?speakers
```

This currently returns 1309 languages, French with most tenses (21) and Hungarian with most cases (24).

## See Also

- [encyclopedia](encyclopedia.md)
- [database](database.md)
Update 2022-10-27 23:45:55 +02:00			`# Wikidata`

Update 2025-07-13 17:57:29 +02:00			Wikidata is a large collaborative [Internet](internet.md) [project](project.md) (a sister project of [Wikipedia](wikipedia.md), hosted by Wikimedia Foundation) building a massively huge noncommercial [public domain](public_domain.md) [database](database.md) of [information](information.md) about everything in existence. Well, not literally everything -- there are some rules about what can be included that are similar to those on [Wikipedia](wikipedia.md), e.g. notability (you can't add yourself unless you're notable enough, of course you can't add illegal data etc.). Wikidata records data in a form of so called [knowledge graph](knowledge_graph.md), i.e. it connects items and their properties with statements such as "Earth:location:inner Solar System", creating a mathematical structure called a [graph](graph.md). The whole database is available to anyone for any purpose without any conditions, under [CC0](cc0.md)!
Update 2022-10-27 23:45:55 +02:00
Update 2025-07-13 17:57:29 +02:00			It should be noted that Wikidata is incredibly useful but a bit unfairly overlooked in the shadow of its giant sibling Wikipedia, even though it offers a way to easily obtain large, absolutely [free](free_culture.md) and public domain data sets about anything. The database can be queried with specialized languages so one can obtain let's say coordinates of all terrorist attacks that happened in certain time period, a list of famous male cats, visualize the tree of biological species, list Jews who run restaurants in Asia or any other crazy thing. Wikidata oftentimes contains extra information that's not present in the Wikipedia article about the item and that's not even quickly found by [googling](google.md), and the information is at times also backed by sources just like on Wikipedia, so it's nice to always check Wikidata when researching anything.
Update 2022-10-27 23:45:55 +02:00
Update 2025-07-13 17:57:29 +02:00			Wikidata was opened on 30 October 2012. The first data that were stored were links between different language versions of Wikipedia articles, later Wikipedia started to use Wikidata to store information to display in infoboxes in articles and so Wikidata grew and eventually became a database of its own. As of 2022 there is a little over 100 million items, over 1 billion statements and over 20000 active users. The database dump in [json](json.md), [COMPRESSED](compression.md) with gzip, takes gargantuous 130 GB.

Update 2025-07-18 14:17:08 +02:00			The first items added to the database were the [Universe](universe.md), [Earth](earth.md), [life](life.md), [death](death.md), [human](people.md) etc. Some cool items include [nigger](nigger.md) (Q1455718), fuck her right in the pussy (Q105676108), fart (Q5436447), [LMAO](lmao.md) (Q103319444), [Anarch](anarch.md) (Q114540914) and [this very wiki](lrs_wiki.md) (Q116266837). The structure of the database actually suggests that apart from the obvious usefulness of the data itself we may also toy around with this stuff in other [fun](fun.md) ways, for example we can use wikidata to give a hint of significance of any thing or concept -- given that two similar things predate wikidata itself, we may assume that the one with lower number is likely more significant for having been added earlier. For instance a [dog](dog.md)'s serial number is 144 and [cat](cat.md)'s is 146, so a dog would "win" this kind of internet battle by a tiny margin. Alternatively we can compare the size of the items' records to decide which one wins in significance. Here dog wins again with 200 kilobytes versus cat's 196 kilobytes.
Update 2022-10-27 23:45:55 +02:00
			`## Database Structure`

			`The database is a [knowledge graph](knowledge_graph.md). It stores the following kinds of records:`

			`- entities: Specific "things", concrete or abstract, that exist and are stored in the database. Each one has a unique [ID](id.md), name (not necessarily unique), description and optional aliases (alternative names).`
			`- items: Objects of the real world, their ID is a number prepended with the letter Q, e.g. [dog](dog.md) (Q144), [Earth](earth.md) (Q2), idea (QQ131841) or [Holocaust](holocaust.md) (Q2763).`
			`- properties: Attributes that items may possess, their ID is a number prepended with the letter P, e.g. instance of (P31), mass (P2067) or image (P18). Properties may have constraints (created via statements), for example on values they may take.`
			- statements: Information about items and properties which may possibly link items/properties (entities) with other items/properties. One statement is so called triplet, it contains a subject (item/property), verb (property) and object (value, e.g. item/property, number, string, ...). I.e. a statement is a record of form entity:property:value, for example dog(Q144):subclass of(P279):domestic mammal(Q57814795). Statements may link one property with multiple values (by having multiple statements about an entity with the same property), for example a man may have multiple nationalities etc. Statements may also optionally include qualifiers that further specify details about the statement, for example specifying the source of the data.

			The most important properties are probably instance of (P31) and subclass of (P279) which put items into [sets](set.md)/classes and establish subsets/subclasses. The instance of attribute says that the item is an individual manifestation of a certain class (just like in [OOP](oop.md)), we can usually substitute is with the word "is", for example Blondi (Q155695, [Hitler](hitler.md)'s dog) is an instance of dog (Q144); note that an item can be an instance of multiple classes at the same time. The subclass of attribute says that a certain class is a subclass of another, e.g. dog (Q144) is a subclass of pet (Q39201) which is further a subclass of domestic animal (Q622852) etc. Also note that an item can be both an instance and a class.

			`## How To`

Update 2025-07-18 14:17:08 +02:00			`Many [libraries](library.md)/[APIs](api.md)/tools exist for accessing wikidata because, unlike shitty [corporations](corporation.md) who guard and obfuscate their data by force, wikidata provides data in friendly ways -- you can even download the whole database dump in several formats including simple ones such as [JSON](json.md) (about 100 GB).`
Update 2022-10-27 23:45:55 +02:00
Update 2025-07-18 14:17:08 +02:00			`Arguably the easiest way to grab some smaller data is through the online query interface (https://query.wikidata.org/), entering a query (in [SPARQL](sparql.md) language, similar to [SQL](sql.md)) and then clicking download data -- you can choose several formats, e.g. [JSON](json.md) or [CSV](csv.md). That can then be processed further with whatever language or tool, be it [Python](python.md), [LibreOffice](libreoffice.md) Calc etc.`
Update 2022-10-27 23:45:55 +02:00
Update 2025-07-18 14:17:08 +02:00			`BEWARE: the query you enter may easily take a long time to execute and time out, you need to write it nicely which for more complex queries may be difficult if you're not familiar with SPARQL. However wikidata offers online tips on [optimization](optimization.md) of queries and there are many examples right in the online interface which you can just modify to suit you. Putting a limit on the number of results usually helps, also try to reorder the conditions and so on.`
Update 2022-10-27 23:45:55 +02:00
Update 2025-07-18 14:17:08 +02:00			`Now finally on to a few actual examples. The first one will show one of the most basic and common queries: just listing items with certain properties, specifically video [games](game.md) of the [FPS](fps.md) genre here:`
Update 2022-10-27 23:45:55 +02:00
			```
Update 2025-07-18 14:17:08 +02:00			`SELECT ?item ?itemLabel ?itemDescription WHERE`
Update 2022-10-27 23:45:55 +02:00			`{`
Update 2025-07-18 14:17:08 +02:00			`?item wdt:P31 wd:Q7889. # item is a video game and`
Update 2022-10-27 23:45:55 +02:00			`?item wdt:P136 wd:Q185029. # item is FPS`

Update 2025-07-18 14:17:08 +02:00			`# this gets item labels (you can append "Label" or "Description" to any requested variable now):`
Update 2022-10-27 23:45:55 +02:00			`SERVICE wikibase:label { bd:serviceParam wikibase:language "[AUTO_LANGUAGE],en". }`
			`}`
			`LIMIT 100 # limit to 100 results, make the query faster`
			```

Update 2025-07-18 14:17:08 +02:00			`The language is somewhat intuitive, you basically enter conditions and the database then searches for records that satisfy them, but if it looks hard just see some tutorial.`

			`OK, how about some lulz now? Let's search for human [races](race.md), then count them and compute their average, minimum and maximum height:`
Update 2022-10-27 23:45:55 +02:00
			```
Update 2025-07-18 14:17:08 +02:00			`SELECT ?race ?raceLabel ?raceDescription (COUNT(?human) AS ?count) (AVG(?height) AS ?averageHeight) (MAX(?height) AS ?maxHeight) (MIN(?height) AS ?minHeight) WHERE`
Update 2022-10-27 23:45:55 +02:00			`{`
Update 2025-07-18 14:17:08 +02:00			`{ # subquery for optimization (delaying label retrieval)`
			`SELECT ?human ?race ?height WHERE`
			`{`
			`?human wdt:P31 wd:Q5. # is human`
			`?human wdt:P172 ?race. # has race`

			`# has height in centimetres:`
			`?human p:P2048 ?st1.`
			`?st1 psv:P2048 ?vn1.`
			`?vn1 wikibase:quantityAmount ?height.`
			`?vn1 wikibase:quantityUnit wd:Q174728.`
			`} LIMIT 10000`
			`}`

			`SERVICE wikibase:label { bd:serviceParam wikibase:language "[AUTO_LANGUAGE],mul,en". }`
			`} GROUP BY ?race ?raceLabel ?raceDescription ORDER BY DESC(?count)`
			```
Update 2022-10-27 23:45:55 +02:00
Update 2025-07-18 14:17:08 +02:00			`Current this returned 331 races, the most frequent (in the database) being "[African American](nigger.md)" with average height 181 cm, then White Americans (171 cm), White People (167 cm) etc. Now let's shit on [privacy](privacy.md) and make an [NSA](nsa.md) style database or people along with personal data such as their names, birth and death dates, causes of death etc.:`

			```
			`SELECT ?human ?humanLabel ?humanDescription ?sexLabel ?birthDate ?birthPlaceLabel ?deathDate ?deathCauseLabel`
			`WITH`
			`{`
			`SELECT ?human ?birthDate ?birthPlace ?sex ?deathDate ?deathCause WHERE`
			`{`
			`?human wdt:P31 wd:Q5.`
			`?human wdt:P569 ?birthDate.`
			`?human wdt:P19 ?birthPlace.`
			`?human wdt:P21 ?sex.`
			`OPTIONAL { ?human wdt:P570 ?deathDate. }`
			`OPTIONAL { ?human wdt:P509 ?deathCause. }`
			`FILTER (?birthDate >= "1000-01-01T00:00:00Z"^^xsd:dateTime)`
			`} LIMIT 10000`
			`} AS %data`
			`WHERE`
			`{`
			`INCLUDE %data`
			`SERVICE wikibase:label { bd:serviceParam wikibase:language "[AUTO_LANGUAGE],mul,en" .`
			`}`
Update 2022-10-27 23:45:55 +02:00			`}`
Update 2025-07-18 14:17:08 +02:00			```

			`Cool, works pretty nicely. Another [interesting](interesting.md) query may be one about [languages](human_language.md), counting their grammatical cases, tenses etc.:`

			```
			`SELECT ?languageLabel ?nativeName ?typeLabel ?countryLabel ?writingLabel ?code1 ?code2 ?speakers ?cases ?tenses ?genders WHERE`
			`{`
			`{`
			`SELECT`
			`?language`
			`(MAX(?nn) AS ?nativeName)`
			`(MAX(?ws) AS ?writing)`
			`(MAX(?sp) AS ?speakers)`
			`(MAX(?c1) AS ?code1)`
			`(MAX(?c2) AS ?code2)`
			`(MAX(?ty) AS ?type)`
			`(MAX(?co) AS ?country)`
			`?cases`
			`?tenses`
			`?genders`
			`WHERE`
			`{`
			`{ ?language wdt:P31 wd:Q33742. } UNION { ?language wdt:P31 wd:Q20162172. } UNION { ?language wdt:P31 wd:Q33215. } # is one of these`
			`OPTIONAL{?language wdt:P1098 ?sp. }`
			`OPTIONAL{?language wdt:P1705 ?nn.}`
			`OPTIONAL{?language wdt:P282 ?ws.}`
			`OPTIONAL{?language wdt:P218 ?c1.}`
			`OPTIONAL{?language wdt:P219 ?c2.}`
			`OPTIONAL{?language wdt:P279 ?ty.}`
			`OPTIONAL{?language wdt:P2341 ?co.}`
			`OPTIONAL{ SELECT ?language (COUNT(?tmp) AS ?cases) WHERE { ?language wdt:P2989 ?tmp. } GROUP BY ?language }`
			`OPTIONAL{ SELECT ?language (COUNT(?tmp) AS ?tenses) WHERE { ?language wdt:P3103 ?tmp. } GROUP BY ?language }`
			`OPTIONAL{ SELECT ?language (COUNT(?tmp) AS ?genders) WHERE { ?language wdt:P5109 ?tmp. } GROUP BY ?language }`
			`} GROUP BY ?language ?cases ?tenses ?genders ?countries`
			`}`
			`SERVICE wikibase:label { bd:serviceParam wikibase:language "[AUTO_LANGUAGE],mul,en". } # here assign label`
			`} ORDER BY ?speakers`
			```

			`This currently returns 1309 languages, French with most tenses (21) and Hungarian with most cases (24).`

			`## See Also`

			`- [encyclopedia](encyclopedia.md)`
			`- [database](database.md)`