Turtle parser fails for unknown reason (SPARQL 1.1 Syntax Update 1 manifest.ttl) #37

k00ni · 2021-03-11T18:02:59Z

@pietercolpaert When parsing turtle file https://www.w3.org/2009/sparql/docs/tests/data-sparql11/syntax-update-1/manifest parser returns 0 triples.

ARC2's Turtle parser has no problems and returns all triples, so file should be correct.

I made a test which shows the problem: https://github.com/pietercolpaert/hardf/blob/error/turtle-parser-fails-unknown/test/TriGParserTest.php#L2061-L2072 (branch error/turtle-parser-fails-unknown)

Any idea why?

zozlak · 2024-02-15T08:52:10Z

Just out of curiosity - what's the subject created by the ARC2 parser for the <> rdf:type mf:Manifest triple?

The problem is the <> subject. By the turtle specification it's a relative IRI. As a relative IRI it should be resolved using the base IRI. The problem is the base IRI is not set (the @prefix : <http://www.w3.org/2009/sparql/docs/tests/data-sparql11/syntax-update-1/manifest#> . sets the default prefix for prefixed names but not the base IRI for relative IRIs; btw it means parsing works if you add @base <http://www.w3.org/2009/sparql/docs/tests/data-sparql11/syntax-update-1/manifest#> . to the file).

I guess it is expected to work the way that when a remote RDF dataset is being read, the reader should set the base IRI to the remote document location but as the TriGParser does not know the source is a remote dataset, it's on you to include the 'documentIRI' => 'https://www.w3.org/2009/sparql/docs/tests/data-sparql11/syntax-update-1/manifest' in the TriGParser constructor options.

zozlak · 2024-02-15T09:07:34Z

Still, we should expect the TriGParser either to throw an error when the result of an IRI resolution is empty (because as you shown it harms further processing) or to have some base IRI default.

Funnily both solutions seems to be standards-conformant. For the definition of the base IRI the RDF concepts redirects us to the RFC3986 which says there are four possible sources for the base URI (and first available should be used):

Base URI Embedded in Content which in the RDF context would be the @base/BASE turtle/SPARQL directive in an RDF document
Base URI from the Encapsulating Entity but I can not think of any RDF world equivalence for that
Base URI from the Retrieval URI which is self-descriptive (and this is a solution I mentioned in the previous comment)
Default Base URI - let's just cite the RFC as it nicely explains the trade-off between having and not having it:

If none of the conditions described above apply, then the base URI is defined by the context of the application. As this definition is necessarily application-dependent, failing to define a base URI by using one of the other methods may result in the same content being interpreted differently by different types of applications.

zozlak · 2024-02-27T14:49:44Z

Discussed it with @k00ni on the side and he's in favour of using a default base URI as a fallback. A pull request will follow shortly.

pietercolpaert · 2024-02-27T14:56:10Z

The base URI MUST be configurable in the parser and should be the last URI after redirection of the HTTP fetch that took place. If this file just comes from disk, we should expect an @base somewhere. If there’s no @base, probably the base IRI I believe should be the current file path.

I would not use a default base URI hardcoded in the library

pietercolpaert · 2024-02-27T15:12:09Z

We should probably add this as a test-case too!

zozlak · 2024-02-27T15:45:16Z

We never considered fixing it to a single value. The only option considered was the last point of the RFC3986 cited below - a fallback when no better option is available. That being said on the TriGParser level the source is just a string so if there is no documentIRI set, we have no way to use a URL from which the data was downloaded or a file name as we just don't have access to this kind of information.

Anyway after some discussion with @k00ni I convinced him to throw an error in such a case with the message explaining the data can't be parse because the base URI is unknown and that in should be set with the documentIRI parser constructor option.

zozlak · 2024-02-27T16:04:58Z

One more solution to consider. The current problem is caused by the non-strict type comparisons of the TriGParser::$readEntity() callback return value with null (in which case an empty string equals null). This can be solved by turning these comparisons into type-strict ones (=== instead of ==). As a result the parser will parse <> <a> "b" . as ['', 'a', '"b"']). What do you think?

I personally think throwing an error explaining the lack of the base URI is still the best solution.

See pietercolpaert#37 (closes pietercolpaert#37)

zozlak · 2024-02-27T17:07:07Z

In fact the TriGParserTest::testBlankNodes() starts with a test of parsing <> with lacking base IRI as empty subject, predicate, object and graph.

Funnily (or not) it works there but a document like <> <b> <c> . (or the one @k00ni is mentioning in this issue) turns the parser into undefined state.

Should I investigate it or can we assume the test should be adjusted to expect the error to be thrown there? (btw <> are relative IRIs so I'm not sure why they are tested in the testBlankNodes())

* TriGParser: distinguish empty entities from no-etity being read See #37 (closes #37) * TriGParserTest::testBlankNodes() adjusted * removed the prefixed-only IRIs input line from the first test scenario as this does not belong to the testBlankNodes() tests and is tested aleady in testIssue37() * turned the empty prefixed IRIs test scenario into two - first, where and error is expected due to unknown document base IRI and second, where parsing succeeds thanks to `documentIRI` parser option being set * Update test/TriGParserTest.php * Update test/TriGParserTest.php --------- Co-authored-by: Konrad Abicht <[email protected]>

k00ni · 2024-02-28T12:06:23Z

Because of various PHP-magic, I would be in favor of using strict-comparisons (===) wherever we can.

Funnily (or not) it works there but a document like <> <b> <c>. (or the one @k00ni is mentioning in this issue) turns the parser into undefined state. Should I investigate it or can we assume the test should be adjusted to expect the error to be thrown there? (btw <> are relative IRIs so I'm not sure why they are tested in the testBlankNodes())

If our current tests don't reflect the specified behavior, they have to be adapted accordingly. I am not sure what you mean with "turns parser into undefined state". We should aim for a reliable solution which we understand and control. I have no problem if the parser acts weird (at least for a while) in some edge cases which no one uses.

However, I really appreciate you taking the time here @zozlak.

zozlak · 2024-02-28T12:22:06Z

I am not sure what you mean with "turns parser into undefined state"

The TriGParser is implemented as a state machine but in a little funny way (which I assume just follows the original JS library implementation) - the current state is hold as a property storing the state handling callable and updated with this callable return value. Something like:

$curState = $this->parseTopLevel;
foreach ($lexems as $i) {
  $curState = $curState($i);
}

Our issue was for an empty relative IRI with no document base IRI the callable handling an entity parsing was returning NULL, resulting in $curState being set to NULL, resulting in no further processing being done (as you surely noticed in practice it's a little more subtle as $curState = NULL; $curState($lexem); would throw an PHP error but I hope you get the general idea). When I'm telling about "turning parser into undefined state", I'm talking about ending up with $curState = NULL ("escaping" ;-) the parsing state machine).

zozlak · 2024-02-28T12:44:52Z

Because of various PHP-magic, I would be in favor of using strict-comparisons (===) wherever we can

I agree with this as a general statement but I don't see reviewing the hardf code base for that in the predictable future.

zozlak · 2024-06-26T07:33:52Z

Could someone with enough rights (@pietercolpaert or @k00ni I think) make a release including the #42 merge, please?

k00ni · 2024-06-26T07:51:15Z

Done.

k00ni added the bug label Mar 11, 2021

k00ni self-assigned this Mar 11, 2021

k00ni changed the title ~~Turtle parser fails for unknown reason~~ Turtle parser fails for unknown reason (SPARQL 1.1 Syntax Update 1 manifest.ttl) Mar 11, 2021

zozlak mentioned this issue Feb 27, 2024

How to parse N3 files? sweetrdf/quickRdfIo#5

Closed

zozlak added a commit to zozlak/hardf that referenced this issue Feb 27, 2024

TriGParser: distinguish empty entities from no-etity being read

2bbfaf3

See pietercolpaert#37 (closes pietercolpaert#37)

zozlak mentioned this issue Feb 27, 2024

TriGParser: distinguish empty entities from no-entity being read #42

Merged

k00ni linked a pull request Feb 28, 2024 that will close this issue

TriGParser: distinguish empty entities from no-entity being read #42

Merged

k00ni closed this as completed in #42 Feb 28, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Turtle parser fails for unknown reason (SPARQL 1.1 Syntax Update 1 manifest.ttl) #37

Turtle parser fails for unknown reason (SPARQL 1.1 Syntax Update 1 manifest.ttl) #37

k00ni commented Mar 11, 2021 •

edited

Loading

zozlak commented Feb 15, 2024 •

edited

Loading

zozlak commented Feb 15, 2024

zozlak commented Feb 27, 2024

pietercolpaert commented Feb 27, 2024

pietercolpaert commented Feb 27, 2024

zozlak commented Feb 27, 2024

zozlak commented Feb 27, 2024

zozlak commented Feb 27, 2024

k00ni commented Feb 28, 2024 •

edited

Loading

zozlak commented Feb 28, 2024

zozlak commented Feb 28, 2024

zozlak commented Jun 26, 2024

k00ni commented Jun 26, 2024

Turtle parser fails for unknown reason (SPARQL 1.1 Syntax Update 1 manifest.ttl) #37

Turtle parser fails for unknown reason (SPARQL 1.1 Syntax Update 1 manifest.ttl) #37

Comments

k00ni commented Mar 11, 2021 • edited Loading

zozlak commented Feb 15, 2024 • edited Loading

zozlak commented Feb 15, 2024

zozlak commented Feb 27, 2024

pietercolpaert commented Feb 27, 2024

pietercolpaert commented Feb 27, 2024

zozlak commented Feb 27, 2024

zozlak commented Feb 27, 2024

zozlak commented Feb 27, 2024

k00ni commented Feb 28, 2024 • edited Loading

zozlak commented Feb 28, 2024

zozlak commented Feb 28, 2024

zozlak commented Jun 26, 2024

k00ni commented Jun 26, 2024

k00ni commented Mar 11, 2021 •

edited

Loading

zozlak commented Feb 15, 2024 •

edited

Loading

k00ni commented Feb 28, 2024 •

edited

Loading