lingvoj.org

Resources for the multilingual Semantic Web

What's up?

2010-05-10 : lingvoj.org meets Lexvo.org

Since the launch of lingvoj.org in 2007, the linked data cloud has grown at a steady pace, and a growing number of URI sets have been published to identify human languages. Lexvo.org is providing the most exhaustive of those so far, in which URIs for languages are integrated in a global approach of terminology. Through exchanges with Gerard de Melo, editor at Lexvo.org, it has been decided to redirect and deprecate lingvoj.org URIs for individual languages to the benefit of the more stable and exhaustive publication at Lexvo.org.

From today most lingvoj.org URIs for individual languages are redirected to lexvo.org URIs through content negotiation. A few exceptions are URIs of languages with no ISO 639-3 codes, since lexvo URIs are built on those codes, and languages with a regional tag, such as en-us.

The lingvo-to-lexvo RDF file provides the mappings and equivalence between lingvo and lexvo URIs. Applications using the lingvoj.org URIs are invited to change their references accordingly, although the redirection mechanism should avoid any breakdown of applications using lingvoj.org URIs.

What's in there?

  • Each individual language is identified by a URI in the namespace http://www.lingvoj.org/lang/. The fragment identifier is a language code conformant to BCP 47 .
    The language code is a two-letters code defined by ISO 639-1 when available, or three-letters ISO 639-2, or 639-3 default the previous ones. It can also include regional tags, for example en-us or en-gb. Such codes are used as values of the "xml:lang" attribute, and also as the prefix of the Wikipedia in this language.
    For example "zh" is the code for Chinese language. Therefore this language is identified by http://www.lingvoj.org/lang/zh .
    Content negociation is used to redirect this URI either to a human-readable HTML page http://www.lingvoj.org/lingvo/zh.html , or to a RDF page containing the formal description http://www.lingvoj.org/lingvo/zh.rdf .
  • The Lingvoj Ontology is used, declaring a "Lingvo" Class, its attributes such as ISO 639 codes and the way to link languages to FOAF resources (properties having Lingvo class as rdfs:range). Some examples of the use of those properties are available in this FOAF profile.

Disclaimer
Resources provided here have no official status, but URIs in the lingvoj namespaces are intented to remain "cool", which means stable and dereferencable. ISO codes have been cross-checked carefully. Beyond that, quality of information for each resource may vary. Labels are subject to change with data quality checking. Alternate URIs linking to other data sets may not always be available and dereferencable.

Contact : Bernard Vatant

History

2009-08-14 : Added a voiD description of the lingvoj Data Set

2009-04-06 : New version of the ontology
Version 1.3 introduces the use of dcterms:language, as a superproperty of various lingvoj object properties, and its inverse property "is language of", used to link to active Wikipedia in the language when available (265 such languages to date).

2009-04-02 : Links to and from other data sets in the Linked Data Cloud

  • Lingvoj.org URIs are used by the Linked Movie Data Base for languages of movies. Over 28,000 links to 82 different languages.
  • Musicbrainz language URIs are now linked to and from lingvoj.org URIs.
  • Hugh Glaser and Ian Millard from the RKB Explorer initiative team provided hundreds of alternate URIs mapped to lingvoj's ones, including UMBEL and Wordnet URIs.

2009-04-01 : Long overdue new release!
Updated Wikipedia languages and labels. Some stats : 522 languages, 1,585 values of ISO 639 codes, 1,452 alternate URIs, and 16,950 labels in 251 different languages. All languages have at least an english label. Most represented languages for multilingual labels are French (549), German (457), Russian (419) and Spanish (407) ... more.
Fixed CYC URIs, which had been broken for a while, and added frebase URIs. Both based on DBpedia owl:sameAs declarations.
Simplified the stylesheet, removing the (questionable and uncomplete) depictions. The stylesheet is now called from inside the RDF individual files, both IE and Firefox seem happy with it.

2008-01-28 : Release of Lingvoj Ontology v1.2, declaring the "Lingvo" class as subclass of "LinguisticSystem" as defined by the new release of Dublin Core terms in RDFS.

2007-11-29 : Release of Lingvoj Ontology v1.1, including the Translation class, allowing to declare facts such as : The resource A in original language L1 has beeen translated into resource B in target language L2, by the the translator Z. Examples of use for translations of W3C recommandations.

2007-10-09 : Eventually, with the precious help from the Linking Open Data community, achieved publication with proper content negociation, which works well with Firefox. For some reason this content negociation is not well supported by Internet Explorer.
Note that this results in new URIs for languages. URIs used in previous versions are no longer supported. Cools URIs never change, which means the previous ones were not cool, and the new ones should be stable from now on.

What does "lingvoj" mean?

"Lingvoj" means "Languages" in Esperanto. It's the plural form of "Lingvo".

Why do we need that?

Languages are an endangered heritage

According to Ethnologue, the number of human languages currently used in the world amounts to almost 7,000. About half of them is on the verge of extinction. Only a small fraction is supported by some writing system and have written heritage, and among those, still less are used in modern information systems and on the Web. A good idea of the number of languages used on the Web is provided by the multilingual editions of Wikipedia, to-date 265 different languages.
If ranking of languages by importance of their respective wikipedias is a fairly good indicator for the Web influence of their communities of speakers, it is of course very different from the ranking obtained by the number of speakers. An interesting indicator for each language is the ratio of number of articles per number of speakers. For English, it is about 1 to 200, whereas for Hindi, it is about 1 to 30,000.

See also: The Wikipedia Challenge

We need languages as RDF resources

In current XML and RDF practice, languages are identified by tags, typically used in the "xml:lang" attribute. The allowed values of tags are defined by BCP 47. Those language tags are typically used for rdfs:label or rdfs:comment, and allow the filtering of such elements of description by language, for example in SPARQL queries. But they do not provide support for queries such as:

  • "Can I find native speakers of Bengali in Berlin?"
  • "Which books by Victor Hugo are translated in Arabic?"
  • "Is this software documented in Chinese?".

To answer such queries, languages need to be represented as resources, linked to other resources representing books, people, organizations, places, events, products ... through object properties. DBpedia provides some information of this kind, like e.g., the countries of which Bengali is official language. But more can be done, for example simple add-on to FOAF defining properties enabling to capture information of the level of proficiency of a person in a language, as defined in Wikipedia:Babel.

See also: Languages as RDF resources on the ESW Wiki.

Sources

Sources defining languages as RDF resources

Other sources

  • Ethnologue, provides a description page for every language in ISO 639-3 code list.
  • Multilingual Wikipedias, and interwiki links, is the source used to found out labels of a language in other languages.
This material is Open Data