Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Turtle parser fails for unknown reason (SPARQL 1.1 Syntax Update 1 manifest.ttl) #37

Closed
k00ni opened this issue Mar 11, 2021 · 13 comments · Fixed by #42
Closed

Turtle parser fails for unknown reason (SPARQL 1.1 Syntax Update 1 manifest.ttl) #37

k00ni opened this issue Mar 11, 2021 · 13 comments · Fixed by #42
Assignees
Labels

Comments

@k00ni
Copy link
Collaborator

k00ni commented Mar 11, 2021

@pietercolpaert When parsing turtle file https://www.w3.org/2009/sparql/docs/tests/data-sparql11/syntax-update-1/manifest parser returns 0 triples.

ARC2's Turtle parser has no problems and returns all triples, so file should be correct.

I made a test which shows the problem: https://github.com/pietercolpaert/hardf/blob/error/turtle-parser-fails-unknown/test/TriGParserTest.php#L2061-L2072 (branch error/turtle-parser-fails-unknown)

Any idea why?

@k00ni k00ni added the bug label Mar 11, 2021
@k00ni k00ni self-assigned this Mar 11, 2021
@k00ni k00ni changed the title Turtle parser fails for unknown reason Turtle parser fails for unknown reason (SPARQL 1.1 Syntax Update 1 manifest.ttl) Mar 11, 2021
@zozlak
Copy link
Contributor

zozlak commented Feb 15, 2024

Just out of curiosity - what's the subject created by the ARC2 parser for the <> rdf:type mf:Manifest triple?

The problem is the <> subject. By the turtle specification it's a relative IRI. As a relative IRI it should be resolved using the base IRI. The problem is the base IRI is not set (the @prefix : <http://www.w3.org/2009/sparql/docs/tests/data-sparql11/syntax-update-1/manifest#> . sets the default prefix for prefixed names but not the base IRI for relative IRIs; btw it means parsing works if you add @base <http://www.w3.org/2009/sparql/docs/tests/data-sparql11/syntax-update-1/manifest#> . to the file).

I guess it is expected to work the way that when a remote RDF dataset is being read, the reader should set the base IRI to the remote document location but as the TriGParser does not know the source is a remote dataset, it's on you to include the 'documentIRI' => 'https://www.w3.org/2009/sparql/docs/tests/data-sparql11/syntax-update-1/manifest' in the TriGParser constructor options.

@zozlak
Copy link
Contributor

zozlak commented Feb 15, 2024

Still, we should expect the TriGParser either to throw an error when the result of an IRI resolution is empty (because as you shown it harms further processing) or to have some base IRI default.

Funnily both solutions seems to be standards-conformant. For the definition of the base IRI the RDF concepts redirects us to the RFC3986 which says there are four possible sources for the base URI (and first available should be used):

  • Base URI Embedded in Content which in the RDF context would be the @base/BASE turtle/SPARQL directive in an RDF document
  • Base URI from the Encapsulating Entity but I can not think of any RDF world equivalence for that
  • Base URI from the Retrieval URI which is self-descriptive (and this is a solution I mentioned in the previous comment)
  • Default Base URI - let's just cite the RFC as it nicely explains the trade-off between having and not having it:

    If none of the conditions described above apply, then the base URI is defined by the context of the application. As this definition is necessarily application-dependent, failing to define a base URI by using one of the other methods may result in the same content being interpreted differently by different types of applications.

@zozlak
Copy link
Contributor

zozlak commented Feb 27, 2024

Discussed it with @k00ni on the side and he's in favour of using a default base URI as a fallback. A pull request will follow shortly.

@pietercolpaert
Copy link
Owner

The base URI MUST be configurable in the parser and should be the last URI after redirection of the HTTP fetch that took place. If this file just comes from disk, we should expect an @base somewhere. If there’s no @base, probably the base IRI I believe should be the current file path.

I would not use a default base URI hardcoded in the library

@pietercolpaert
Copy link
Owner

We should probably add this as a test-case too!

@zozlak
Copy link
Contributor

zozlak commented Feb 27, 2024

We never considered fixing it to a single value. The only option considered was the last point of the RFC3986 cited below - a fallback when no better option is available. That being said on the TriGParser level the source is just a string so if there is no documentIRI set, we have no way to use a URL from which the data was downloaded or a file name as we just don't have access to this kind of information.

Anyway after some discussion with @k00ni I convinced him to throw an error in such a case with the message explaining the data can't be parse because the base URI is unknown and that in should be set with the documentIRI parser constructor option.

@zozlak
Copy link
Contributor

zozlak commented Feb 27, 2024

One more solution to consider. The current problem is caused by the non-strict type comparisons of the TriGParser::$readEntity() callback return value with null (in which case an empty string equals null). This can be solved by turning these comparisons into type-strict ones (=== instead of ==). As a result the parser will parse <> <a> "b" . as ['', 'a', '"b"']). What do you think?

I personally think throwing an error explaining the lack of the base URI is still the best solution.

@zozlak
Copy link
Contributor

zozlak commented Feb 27, 2024

In fact the TriGParserTest::testBlankNodes() starts with a test of parsing <> with lacking base IRI as empty subject, predicate, object and graph.

Funnily (or not) it works there but a document like <> <b> <c> . (or the one @k00ni is mentioning in this issue) turns the parser into undefined state.

Should I investigate it or can we assume the test should be adjusted to expect the error to be thrown there? (btw <> are relative IRIs so I'm not sure why they are tested in the testBlankNodes())

@k00ni k00ni linked a pull request Feb 28, 2024 that will close this issue
@k00ni k00ni closed this as completed in #42 Feb 28, 2024
k00ni added a commit that referenced this issue Feb 28, 2024
* TriGParser: distinguish empty entities from no-etity being read

See #37
(closes #37)

* TriGParserTest::testBlankNodes() adjusted

* removed the prefixed-only IRIs input line from the first test scenario
  as this does not belong to the testBlankNodes() tests and is tested
  aleady in testIssue37()
* turned the empty prefixed IRIs test scenario into two - first, where
  and error is expected due to unknown document base IRI and second,
  where parsing succeeds thanks to `documentIRI` parser option being set

* Update test/TriGParserTest.php

* Update test/TriGParserTest.php

---------

Co-authored-by: Konrad Abicht <[email protected]>
@k00ni
Copy link
Collaborator Author

k00ni commented Feb 28, 2024

Because of various PHP-magic, I would be in favor of using strict-comparisons (===) wherever we can.

Funnily (or not) it works there but a document like <> <b> <c>. (or the one @k00ni is mentioning in this issue) turns the parser into undefined state. Should I investigate it or can we assume the test should be adjusted to expect the error to be thrown there? (btw <> are relative IRIs so I'm not sure why they are tested in the testBlankNodes())

If our current tests don't reflect the specified behavior, they have to be adapted accordingly. I am not sure what you mean with "turns parser into undefined state". We should aim for a reliable solution which we understand and control. I have no problem if the parser acts weird (at least for a while) in some edge cases which no one uses.

However, I really appreciate you taking the time here @zozlak.

@zozlak
Copy link
Contributor

zozlak commented Feb 28, 2024

I am not sure what you mean with "turns parser into undefined state"

The TriGParser is implemented as a state machine but in a little funny way (which I assume just follows the original JS library implementation) - the current state is hold as a property storing the state handling callable and updated with this callable return value. Something like:

$curState = $this->parseTopLevel;
foreach ($lexems as $i) {
  $curState = $curState($i);
}

Our issue was for an empty relative IRI with no document base IRI the callable handling an entity parsing was returning NULL, resulting in $curState being set to NULL, resulting in no further processing being done (as you surely noticed in practice it's a little more subtle as $curState = NULL; $curState($lexem); would throw an PHP error but I hope you get the general idea). When I'm telling about "turning parser into undefined state", I'm talking about ending up with $curState = NULL ("escaping" ;-) the parsing state machine).

@zozlak
Copy link
Contributor

zozlak commented Feb 28, 2024

Because of various PHP-magic, I would be in favor of using strict-comparisons (===) wherever we can

I agree with this as a general statement but I don't see reviewing the hardf code base for that in the predictable future.

@zozlak
Copy link
Contributor

zozlak commented Jun 26, 2024

Could someone with enough rights (@pietercolpaert or @k00ni I think) make a release including the #42 merge, please?

@k00ni
Copy link
Collaborator Author

k00ni commented Jun 26, 2024

Done.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants