-
Notifications
You must be signed in to change notification settings - Fork 11
Quality Issues
The vocabulary quality issues defined in the following sections should be applicable to any SKOS vocabulary. Some of the issues are taken from already existing research (see section "Related Work"). Others reflect our thoughts when investigating real-world thesauri (see the data repository). If not stated otherwise, we treat the SKOS vocabularies as fully entailed RDFS graphs. We also enrich the vocabularies by entailment of owl:inverseOf properties as well as instances of owl:TransitiveProperty and owl:SymmetricProperty.
Description | Some controlled vocabularies contain literals in natural language, but without information what language has actually been used. Language tags might also not conform to language standards, such as RFC 3066. |
Example | The network ontology defines a prefLabel of the resource http://river.styx.org/network#AddressFamilies that has no language information. Furthermore, the press contacts information dataset (http://data.southampton.ac.uk/dataset/pressinfo.html) defines labels for resources without stating the language. |
Implementation | Iteration over all triples in the vocabulary that have a predicate which is a (subclass of) rdfs:label or skos:note. |
Description | Some concepts in a thesaurus are labeled in only one language, some in multiple languages. It may be desirable to have each concept labeled in each of the languages that also are used on the other concepts. This is not always possible, but incompleteness of language coverage for some concepts can indicate shortcomings of the vocabulary. |
Example | The MARC List for Countries (http://id.loc.gov/vocabulary/countries.html) defines french labels for some but not all labels. |
Implementation | Iteration over all concepts in the vocabulary and creation of a global set of language tags appearing in the vocabulary. In a second iteration, each concept having a set of language tags that is not equal to the global language tag set is returned. |
Description | Checks if all concepts have at least one common language, i.e. they have assigned at least one literal in the same language. |
Example | |
Preliminary ideas on computation |
Description | The SKOS "standard" defines a number of properties useful for documenting the meaning of the concepts in a thesaurus (section 7) also in a human-readable form. Intense use of these properties leads to a well-documented thesaurus which should also improve its quality. |
Example | Library of Congress Thesaurus for Graphic Materials offers a high coverage of documentation properties |
Implementation | Iteration over all concepts in the vocabulary and find those not using one of skos:note, skos:changeNote, skos:definition, skos:editorialNote, skos:example, skos:historyNote, or skos:scopeNote |
Description | This is a generalization of a recommendation in the SKOS primer, that “no two concepts have the same preferred lexical label in a given language when they belong to the same concept scheme”. This could indicate missing disambiguation information and thus lead to problems in autocompletion application. |
Example | In the concepts http://purl.org/collections/nl/am/t-8523 a skos:prefLabel is defined as "streekvervoer" ("local transport"). Another concept, http://purl.org/collections/nl/am/t-6081 defines skos:altLabel "streekvervoer" and skos:prefLabel "openbaar vervoer" ("public transport"). |
Implementation | Iteration over all authoritative concepts, collecting their respective labels. In a second pass, similarity of all possible label pairs is checkt by a similarity function. Concept labels with a similarity value below a given threshold, are considered conflicting and are returned. In the current implementation, the similarity function is string equality with a threshold equal to 1. |
Description | To make the vocabulary more convenient for humans to use, instances of SKOS classes (Concept, ConceptScheme, Collection) should be labeled using e.g., prefLabel, altLabel, rdfs:label, dc:title. |
Example | |
Preliminary ideas on computation |
Description | pref/alt/hiddenlabels contain characters that are not alphanumeric characters or blanks. |
Example | Newline characters that have been left over from automated vocabulary conversion or invalid user input. |
Preliminary ideas on computation | A SPARQL query would be sufficient to find labels having characters that belong to the unicode general category "Zl", "Zp" and "C" |
Description | Labels also need to contain textual information to be useful, thus we find all SKOS labels with length 0 (after removing whitespaces). |
Example | |
Preliminary ideas on computation |
Description | Concepts within the same concept scheme should not have identlical skos:notation literals. |
Example | |
Preliminary ideas on computation |
SKOS is based on RDF, which is a graph-based data model. Therefore we can concentrate on the vocabulary's graph-based structure for assessing the quality of SKOS vocabularies and apply graph- and network-analysis techniques.
Description | An orphan concept is a concept without any associative or hierarchical relations. It might have attached literals like e.g., labels, but is not connected to any other resource, lacking valuable context information. A controlled vocabulary that contains many orphan concepts is less usable for search and retrieval use cases, because, e.g., no hierarchical query expansion can be performed on search terms to find documents with more general content. |
Example | In the press contacts information dataset from the University of Southampton (http://data.southampton.ac.uk/dataset/pressinfo.html), SKOS concepts are defined but aren't linked to other resources using SKOS properties (e.g., http://id.southampton.ac.uk/pressinfo/subject/ActiveSourceSeismology). Similarly, the http://river.styx.org/network ontology defines the concept http://river.styx.org/network#AddressFamily, but doesn't link it using SKOS properties. |
Implementation | Iteration over all concepts in the vocabulary and returning that don't have associated resources using (subproperties of) skos:semanticRelation. |
Description | Checking the connectivity of the graph, it is possible to identify all weakly connected components. These datasets form "islands" in the vocabulary and might be caused by incomplete data acquisition, "forgotten" test data, outdated terms and the like. |
Example | The dmGeo vocabulary consists of 5 weakly connected components. It was available at http://www.dismarc.org but now seems to be offline. Weakly connected components can also be found in the LVAk thesaurus. |
Implementation | Creation of an undirected graph that includes all non-orphan concepts as nodes and all semantic relations as edges. Tarjan's algorithm then finds and returns all weakly connected components. |
Description | Although perfectly consistent with the SKOS data model, cyclic relations may reveal a logical problem in the thesaurus. Consider the following example: "decision" -> "problem resolution" -> "problem" (-> "decision": here the cycle is closed). The concepts are connected using skos:broader relationships (indicated with "->"). Due to the fact that a thesaurus is in many cases a product of consensus between the contributors (or just the decision of one dedicated thesaurus manager), it will be almost impossible to automatically resolve the cycle (i.e. deleting an edge). |
Example | dbpedia categories: http://dbpedia.org/resource/Category:Wikipedians_in_West_Virginia is related broader to itself. http://dbpedia.org/resource/Category:Republic_of_Macedonia broader http://dbpedia.org/resource/Category:Macedonia broader http://dbpedia.org/resource/Category:Geography_of_the_Republic_of_Macedonia furthermore: http://dbpedia.org/resource/Category:Graphics_software broader http://dbpedia.org/resource/Category:Application_software and vice versa |
Implementation | Construction of a graph having all concepts as nodes and the set of edges being skos:broader relations. |
Description | Two concepts are sibling, but also connected by an associative relation. In that context, the associative relation is not necessary. See ISO_DIS_25964-1, 11.3.2.2 |
Example | The concepts http://eurovoc.europa.eu/2291 and http://eurovoc.europa.eu/4483 in the EuroVoc thesaurus are related associatively (skos:related) and hierarchically. |
Implementation | Identification of all pairs of concepts that have the same broader or narrower concepts, i.e. they are "sibling terms". All siblings that are related by a skos:related property are returned. |
Description | skos:broaderTransitive and skos:narrowerTransitive are, according to the SKOS reference document, "not used to make assertions", so they should not be the only relations hierarchically relating two concepts. |
Example | The NAICS thesaurus contains 2189 concepts that are related directly by skos:broaderTransitive. |
Implementation | Identification of all concept pairs that are related by skos:broaderTransitive or skos:narrowerTransitive properties but not by their skos:broader and skos:narrower subproperties. |
Description | Reciprocal relations (e.g., broader/narrower, related, hasTopConcept/topConceptOf) should be included in the controlled vocabularies to, e.g., to achieve better search results using SPARQL in systems without reasoner support. |
Example | |
Implementation | This issue is checked WITHOUT inference of owl:inverseOf properties. We iterate over all triples and check for each property if an inverse property is defined in the SKOS ontology and if the respective statement using this property is included in the vocabulary. If not, the resources associated with this property are returned. |
Description | A vocabulary should provide "entry points" to the data to provide “efficient access” (SKOS primer) and guidance for human users. |
Example | In EuroVoc, the ConceptScheme http://eurovoc.europa.eu/100141 doesn't have a top concept defined. |
Implementation | For every ConceptScheme in the controlled vocabulary, a SPARQL query is issued finding resources that are associated with this ConceptScheme by one of the properties skos:hasTopConcept or skos:topConceptOf. TODO: extend notion of top concepts also by concepts having no broader concept (as suggested in [Abdul]). |
Description | Concepts "internal to the tree" should not be indicated as top concepts, as pointed out in [Allemang2011]. |
Example | In the PXV vocabulary, http://www.peroxisomekb.nl/v1.6/pxv/C000850 is a top concepts having related to a broader concept. |
Implementation | A SPARQL query finds all top concepts (being defined by one of the properties skos:hasTopConcept or skos:topConceptOf) having associated a broader concept. |
Description | As stated in the SKOS reference document, skos:broader and skos:narrower are not transitive properties. However, they are subproperties of skos:broaderTransitive and skos:narrowerTransitive which enables inference of a "transitive closure". This, in fact, leaves it up to the user to interpret wheter a vocabulary's hierarchical structure is seen as transitive or not. In the former case, this check can be useful. It finds pairs of concepts (A,B) that are directly hierarchically related but there also exits an hierarchical path through a concept C that connects A and B. |
Example | Concept A has a broader concept B. If a concept C exists, such that A broader B and B broader C, the hierarchical relation A broader C is considered redundant. |
Implementation | These structures can be found by a single SPARQL query. |
Description | Concepts related to themsevels. |
Example | |
Implementation | These structures can be found by a single SPARQL query. |
When publishing Linked Data, it is important to respect the following "rules" (also see http://www.w3.org/DesignIssues/LinkedData.html):
- data is provided using standard formats (e.g., RDF which is obviously the case for SKOS vocabularies)
- linked resources are dereferencable and provide further information
- data linked to and from other resources
The issue introduced in this section can be used to create computable metrics for measuring data linkage.
Description | The usage of its concepts can be an indicator for a vocabulary's quality. Usage can be determined by the number of external resources, referencing these concepts. |
Example |
Consider the concept "http://dbpedia.org/resource/Michael_Jackson". It is, for example, referenced by the following resources:
|
Implementation | For each authoritative concept in the vocabulary, a SPARQL query (against, e.g. the Sindice endpoint) is issued that returns all triples in which the concept shows up as the object. An estimation of the number of other vocabularies referencing the concept can be obtained by examining if the host part of the returned triple subject URIs does't match the publishing host of the vocabulary. Concepts for which no such matches can be found are returned. |
Description |
SKOS concepts can define links to other concepts within one and the same vocabulary, to concepts in other vocabularies, or to external resources on the Web. These external links are essential to, for example,
|
Example | The New York Times People Vocabulary is aligned to dbpedia using owl:sameAs properties (e.g., http://data.nytimes.com/64870337666324078863). |
Implementation | For each authoritative concept in the vocabulary, a SPARQL query is issued that returns all IRIs that occur as subject or object in triples where this concept is involved. All IRIs that are HTTP URIs and refer to a non-authoritative resource for the concept are counted. Concepts with a count that equals zero are returned. |
Description | If concepts link to other resources (link targets) on the Web, it is important that these resources are dereferencable and return a response code other than 200 after possible redirections. |
Example | New York Times People Vocabulary: Response 404 when dereferencing http://data.nytimes.com/elements/manual |
Implementation | A SPARQL query is issued that finds all HTTP URIs being part (as subject, predicate, or object) of a triple in the vocabulary. The found URIs are then dereferenced and returned if the HTTP response code (after possible redirections) is other than 200. |
Description | The vocabulary should not invent any new terms within the SKOS namespace or use “deprecated” SKOS elements like those defined in Appendix D of the SKOS reference. |
Example | NSDL Registry Agents Vocabulary uses skos:status, which is not defined in http://www.w3.org/2009/08/skos-reference/skos.rdf. http://vocabularyserver.com/emergencias/?tema=575 uses skos:subjectIndicator |
Implementation | A SPARQL query finds all IRIs that appear in one of the vocabulary's triples in combination with a "deprecated" predicate. "Invented" new terms are found by a SPARQL query, selecting all IRIs in the vocabulary's RDF triples belonging to the SKOS namespace but are not defined in the SKOS ontology. All terms found by the two mentioned queries are returned. |
Description | URIs should be dereferencable. C. Bizer, How to Publish Linked Data on the Web: "In the context of Linked Data, we restrict ourselves to using HTTP URIs only and avoid other URI schemes such as URNs and DOIs." |
Example | In CFR Thesaurus (Thesaurus in the Legal domain by the Cornell University) a concept has been identified by a file:// URI |
Implementation | A SPARQL query is used to find all IRIs that occur as subject in the vocabulary's RDF triples. If their protocol identifier is other than http or https, the resource is returned. |
This category defines issues that relate to specific design decisions of the SKOS ontology. Some of them are also semi-formally expressed in the SKOS reference documentation.
Description | Covers condition S27 from the SKOS reference document, that has not been defined formally. |
Example | In the AGROVOC thesurus, the concepts http://aims.fao.org/aos/agrovoc/c_118, http://aims.fao.org/aos/agrovoc/c_2969 are affected by this issue. |
Implementation | In a first step, all pairs of concepts are found that are associatively connected, using a SPARQL query. In the second step, a graph is created, containing only hierarchically related concepts and the respective relations. For each concept pair from the first step, we check for a path in the graph from step two. If such a path is found, a clash has been identified and the causing concepts are returned. |
Description | Covers condition S46 from the SKOS reference document, that has not been defined formally. |
Example | |
Implementation | Can be solved by issuing a SPARQL query. |
Description | According to the SKOS reference document, "A resource has no more than one value of skos:prefLabel per language tag". |
Example | For the concept http://dbpedia.org/resource/Income_tax, the STW thesaurus mappings define two german prefLabels: "Einkommensteuer" and "Einkommensteuer (Deutschland)". |
Implementation | A SPARQL query is used to find concepts with at least two prefLabels. In a second step, the language tags of these prefLabels are analyzed and an ambiguity is detected if they are equal. |
Description | Covers condition S13 from the SKOS reference document (section 5.4) stating that "skos:prefLabel, skos:altLabel and skos:hiddenLabel are pairwise disjoint properties". |
Example | The concept http://aims.fao.org/aos/agrovoc/c_35337 in AGROVOC has the string literal "tüske" defined as both prefLabel and altLabel. http://www.afp-ifm-thesaurus.net/t-pro/Merkzeichen prefLabel and altLabel identical (marques@fr) http://lod.geospecies.org/kingdoms/Af?format=rdf prefLabel and altLabel identical (Bacteria) http://www.afp-ifm-thesaurus.net/t-pro/Motto prefLabel and hiddenLabel identical (motto@de) |
Implementation | A SPARQL query collects all labels of all concepts, building an in-memory structure. This structure is then checked for disjoint entries. |
Description | According to the SKOS reference documentation, mapping relations (e.g., skos:broadMatch or skos:relatedMatch) should be asserted to concepts being members of different concept schemes. This check finds concepts that are related by a mapping property and are either members of the same concept scheme or members of no concept scheme at all. |
Example | The concept labeled "jaguar" is member of concept scheme labeled "animals". Furthermore, the concept "cat" is member of the same concept scheme and "jaguar" is related to "cat" by skos:broadMatch. Thus, this relation can be considered a misuse of a mapping relation. |
Implementation |