-
Notifications
You must be signed in to change notification settings - Fork 269
Internationalization
General information about the extraction framework is in the main documentation. The procedure is exactly the same but you will have to change some configuration files for better results. Most internationalization (I18n) configuration options are in the core module under org.dbpedia.extraction.config
Questions may be asked on the DBpedia developers list.
You are also encouraged to read the DBpedia I18n paper before proceeding further, all issues in this page are discussed there
- DBpedia – A Large-scale, Multilingual Knowledge Base Extracted from Wikipedia
- Internationalization of Linked Data: The case of the Greek DBpedia edition
- Publications about DBpedia
We encourage you to use xx.dbpedia.org as the namespace of your localized extraction of language xx. However, if you want, you can choose the generic domain name dbpedia.org instead of the default xx.dbpedia.org. The option (for now) is in the following file: `[extraction_framework/core/src/main/scala] org.dbpedia.extraction.util.Language.scala
// default: no language use generic domain
val generic = Set[String]()
// change to this if language xx should be extracted using the generic domain
val generic = Set("xx")`
A setting in dump/extract.properties selects if URIs are serialized as URIs, IRIs, or both. For example, with the following settings, files with the suffix iri.nt (containing IRIs in N-Triples format) and uri.nq (containing URIs in N-Quads format) are written.
formats=iri.nt,uri.nq
Currently, these format combinations are available: iri or uri, followed by a dot and one of nt (N-Triples), nq (N-Quads), ttl (Turtle), tql (Turtle Quads – N-Quads with Turtle encoding), triples.trix, quads.trix.
org.dbpedia.extraction.config
Some extractors/Parsers are language sensitive, and you need to set up language specific options for them to work:
- Disambiguation Extractor
- Homepage Extractor
- Image Extractor
- Infobox Extractor
- Inter Language Links Extractor
- Template Parameter Extractor
- Date Time Parser
- Duration
- Flag Template Parser
- Unit Value Parse
If you want to know a list of extractors that you should use for your language, see: http://mappings.dbpedia.org/index.php/DBpedia_datasets
In order to create links to the English DBpedia and to the LOD Cloud, you will have to run some scripts. Go to scripts/shell-script
, and run the interwiki
links, and then the interlinking
scripts.
Interwiki creates owl:sameAs
links between two Wikipedias / DBpedias. It uses the Inter Language Links Extractor, but it removes all one-way links. It has been shown that one-way links (<10%) are responsible for >90% of errors in article linking. (See DBpedia I18n paper sec 5.1)
The interlinking script takes the owl:sameAs
links (output of the previous script), downloads all the databases linking DBpedia to other datasources, and filters them down to the common triples.
You can use any triple store of your choice. Most i18n chapters use Virtuoso, so far. Instructions on how to load your triples into a Virtuoso triple store are available at:
Step-by-step:
http://www.openlinksw.com/dataspace/dav/wiki/Main/VirtAWSDBpedia351C
http://www.openlinksw.com/blog/~kidehen/?id=1654
How Do I?
http://virtuoso.openlinksw.com/dataspace/dav/wiki/Main/#How%20Do%20I...
Guide by DBpedia Polish:
http://translate.google.com/translate?sl=pl&tl=en&js=n&prev=_t&hl=en&ie=...
General Virtuoso instructions:
http://virtuoso.openlinksw.com/dataspace/dav/wiki/Main/VirtRDFInsert
Example loading script (populate.sql):
https://github.com/dbpedia/dbpedia-vad-i18n
There is a script made by DBpedia Greece to clear and reload datasets:
https://github.com/dbpedia/dbpedia-vad-i18n/blob/master/populate.sql
After installing the Virtuoso server, execute the following statements to adjust Virtuoso registry values --
registry_set ('dbp_decode_iri', 'off'); # (or 'on') registry_set ('dbp_domain', 'http://dbpedia.org'); # (the resource namespace) registry_set ('dbp_graph', 'http://dbpedia.org'); # (the graph, usually the same as dbp_domain) registry_set ('dbp_lang', 'en'); # (the default language) registry_set ('dbp_DynamicLocal', 'on'); # (You have to set DynamicLocal to 1 when in the official domain) registry_set ('dbp_category', 'Category'); # (The wikipedia Translation for Category) registry_set ('dbp_imprint', 'http://wiki.dbpedia.org/Imprint'); registry_set ('dbp_website', 'http://wiki.dbpedia.org/'); registry_set ('dbp_lhost', ':80'); registry_set ('dbp_vhost', 'dbpedia.org');
The Virtuoso VAD file is now forked in https://github.com/dbpedia/dbpedia-vad-i18n . Note, that you must set the registry values above before installing this VAD.
The default DBpedia plug-in was changed to parametrically accept these variables. Note that the HTTP protocol only accepts URIs, so an encoding/decoding strategy was implemented to dereference IRIs (just set dbp_decode_iri
to `on~). All TCN (Transparent Content Negotiation) rules have been implemented. (See DBpedia I18n paper, Section 6)
When the server is accessible through the official direction, say xx.dbpedia.org
, where xx
stands for a language code, the virtual host must be declared. To do so, open the conductor interface http://xx.dbpedia.org/conductor
, open tab Web Application Server
, then subtab Virtual Domains & Directories
. Define a host xx.dbpedia.org
on port 80 and interface 0.0.0.0
. If you already installed dbpedia vad, you will need to uninstall it and reinstall it to configure this host as well.
Download DBpedia VAD and install it: vad_install('[path/to/dbpedia_dav.vad]', 0);
(the file dbpedia_dav.vad must be stored in a folder listed in the entry DirsAllowed
of the file virtuoso.ini
).
Note:
If you need to change the values, first uninstall DBpedia VAD : vad_uninstall('dbpedia/[version]');
.
Last version is 1.3.25, you can check which one you installed by entrering vad_list_packages ();
.
To check the registry values, use select registry_get([entry_key]);
, for instance select registry_get('dbp_website');
The Virtuoso server should now be configured properly and you should see something like this.
If you get an "empty" page with code 404 (check with curl -I ), probably you should set DynamicLocal = 0
in your virtuoso.ini
configuration file under section [URIQA]
.
To 'resolve' namespaces like it.dbpedia.org/property
to dbprob
prefix, add and entry in virtuoso conductor under Linked Data / Namespaces tab.
Some chapters keep different services in different machines, or everything in one machine but with software for content management (e.g. Drupal), project management (e.g. TRAC), etc. Some of us chose to use Apache to handle redirects to different services.
Here is an example virtual host configuration (make sure you have mod_rewrite enabled). In this example, 10.0.0.2 is the IP to the main server, and two other servers are located at dataserver.example.com and anothermachine.example.com.
ServerName pt.dbpedia.org ServerAdmin [email protected] DocumentRoot "/var/www/dbpedia" ErrorLog "logs/pt.dbpedia.org-error.log" CustomLog "logs/pt.dbpedia.org-access.log" common # For CORS Header set Access-Control-Allow-Origin "*" RewriteEngine on RewriteRule ^/$ http://10.0.0.2/ [P,L] RewriteRule ^/download/(.*) http://dataserver.example.com/download/$1 [P,L] RewriteRule ^/demo/(.*) http://anothermachine.example.com/demo/$1 [P,L] RewriteRule ^/(.*) http://10.0.0.2/$1 [P,L]
RewriteRule may cause problems with URIs containing non-ASCII characters. In that case, make sure that mod_proxy is enabled and use ProxyPass instead. For example:
ProxyPass /download/ http://dataserver.example.com/download/
If you want to set up a new DBpedia chapter, you first have to download the extraction framework and look at the i18n-specific changes described in this page.
When you are done, you can set up your chapter. Please perform the following steps:
- Add your URLs to the table of chapters: http://wiki.dbpedia.org/Internationalization/Chapters.
- Add your name to the contacts: http://dbpedia.org/Internationalization.
- Send to dbpedia-developers a pull request for modifications you've made to the code.
- Add to the wiki any instructions that were missing when you started, adding the things that you had to figure out yourself.
- Set up a landing page acknowledging the mapping editors and other people that supported the creation of your chapter.
- Give us the IPs for redirection/forwarding to your subdomain (e.g. es.dbpedia.org)
6.1 Send us the URL to your landing page
6.2 Send us the URL to one example resource in your Linked Data deployment
6.3 Send us the URL to your SPARQL endpoint
- Make sure you've spelled DBpedia correctly (it is not DBPedia or dbPedia, it is DBpedia).
- We recommend naming your chapter DBpedia “Insert Language Name”. Examples: DBpedia Portuguese, DBpedia Italiano, DBpédia en français. Avoid using country names or nationalities, specially in cases where the language is spoken in multiple countries. Prefer less esoteric names: Italiano may sound better than Italophone.
- Create (and update on every release) your entry to datahub.io. You can take complete examples form the English DBpedia, the Greek or the Dutch. The following validator service should also be used (if you wan tto be also included in the LOD Cloud): http://validator.lod-cloud.net/validate.php