Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Slash separated ontologies cause unneeded traversals when used as source #61

Open
Maximvdw opened this issue Apr 24, 2022 · 5 comments
Open

Comments

@Maximvdw
Copy link
Contributor

Maximvdw commented Apr 24, 2022

Issue type:

  • 🐛 Bug

Description:

This issue is for fixing performance issues with slash separated ontologies. Lets say you are using X subjects
from the same ontology http://example.com/myontology/A, http://example.com/myontology/B. Currently they are treated as individual datasets and will be traversed individually (as they should). In a normal linked data front-end this would work fine and only fetch these concepts rather than a large dataset that might contain unneeded information.

In some use cases you might be using a lot of concepts from the same ontology, in which case one request to http://example.com/myontology/ would be preferable.

When putting this ontology in sources, I would expect only one request to be make. However, it seems Comunica will still try to fetch the subjects individually creating individual requests for every subject in http://example.com/myontology/.

I think it is similar to the 'similarity' prioritisation in #51 , however I was not certain it is the exact issue that appears here.

Try it out
https://comunica.github.io/comunica-feature-link-traversal-web-clients/builds/default/
Use the following source (http):
http://qudt.org/vocab/unit/
Enable the proxy (also tested it without proxy):
https://proxy.linkeddatafragments.org/
Make sure it is HTTPS and not the default HTTP

Test query:

PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
PREFIX qudt: <http://qudt.org/schema/qudt/>
SELECT ?unitName WHERE {
    ?unit a qudt:Unit ;
          rdfs:label ?unitName .
}

For each concept that is available in the source, it will still create a request to the individual pages (which in the case for this ontology is a lot)


Environment:

I am using the default config along with "@comunica/query-sparql-link-traversal-solid": "0.0.2-alpha.4.0",

@github-actions
Copy link

Thanks for reporting!

@Maximvdw Maximvdw changed the title Slash separated ontologies cause unneeded traversals Slash separated ontologies cause unneeded traversals when used as source Apr 24, 2022
@Maximvdw
Copy link
Contributor Author

Maximvdw commented Apr 24, 2022

Hmm I am not sure if it is intended that actor-rdf-resolve-hypermedia-links-traverse-prune-shapetrees would solve this issue with a fictional new ShapeTree(undefined, undefined, 'http://qudt.org/unit/{id}')?

@rubensworks
Copy link
Member

Thanks for the issue. This is a very interesting case, which I hadn't considered before.

So the problem here is that too many requests are being done for this query. The link traversal algorithm will do lookups for each seperate unit document, even though all required information is actually already present in the initial source.
So we need a mechanism to indicate this fact somehow.

Shapetrees may indeed be a possible solution for this (perhaps using some trickery with cardinalities in shapes), but I'm not sure. In any case, the current shapetrees implementation is incomplete, so it definitely can not be used as-is. I'll report here once I've made some progress on the shapetrees implementation, and when I think it might be helpful here.

In the meantime, content policies may also do the trick, as it should be able to indicate specifically what links can be followed. But this is also still very experimental.

@rubensworks
Copy link
Member

This problem was also mentioned by @jeswr in #84 for the FOAF vocabulary.

@jeswr
Copy link
Member

jeswr commented Nov 26, 2022

even though all required information is actually already present in the initial source. So we need a mechanism to indicate this fact somehow.

One way of doing this is to make use of rdfs:isDefinedBy. In particular, when doing link traversal, all incoming patterns of the form ?s rdfs:isDefinedBy ?o should be stored in a lookup table or in-memory store, so that before a link is added to the queue from link traversal we can first see if it is in the isDefinedBy lookup table and that the document that it isDefinedBy has already been dereferenced.

This would indeed solve the case qudt above which has terms defined as follows:

<http://qudt.org/vocab/unit/AMD>
  a <http://qudt.org/schema/qudt/CurrencyUnit> ;
  a <http://qudt.org/schema/qudt/Unit> ;
  <http://purl.org/dc/terms/description> "Armenia"^^rdf:HTML ;
  <http://qudt.org/schema/qudt/currencyExponent> 0 ;
  <http://qudt.org/schema/qudt/dbpediaMatch> "http://dbpedia.org/resource/Armenian_dram"^^xsd:anyURI ;
  <http://qudt.org/schema/qudt/hasDimensionVector> <http://qudt.org/vocab/dimensionvector/A0E0L0I0M0H0T0D1> ;
  <http://qudt.org/schema/qudt/hasQuantityKind> <http://qudt.org/vocab/quantitykind/Currency> ;
  <http://qudt.org/schema/qudt/informativeReference> "http://en.wikipedia.org/wiki/Armenian_dram?oldid=492709723"^^xsd:anyURI ;
  rdfs:isDefinedBy <http://qudt.org/2.1/vocab/unit> ;
  rdfs:isDefinedBy <http://qudt.org/vocab/unit> ;
  rdfs:label "Armenian Dram"@en ;
.

Note in order for this to work properly all links the responseURL should also be added to the set of already dereferenced documents (though maybe this is the job of the http cache?) and ideally one would also trackRedirects if using a library with an API like follow-redirects to further optimise this process.

cc @pmcb55

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants