-
Notifications
You must be signed in to change notification settings - Fork 17
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
VALUES Sparql interpolation with comments fails #118
Comments
Rick commented elsewhere:
I have found the cause of this issue. It's caused by some slightly naive logic in Sesame. Nevertheless, the easy solution is to strip comments in Grafter, as Rick has suggested, rather than alter Sesame. I have a patch almost ready and am currently writing tests to provide confidence and cover edge cases. |
Nice one @scottlowe, Is the sesame issue fixed in RDF4j? Because if it is it might be worth postponing until we can use an RDF4j grafter build. If it's not fixed in RDF4j, we should file a bug report. There are some edge cases to consider with Finally a complicating case is multiline strings containing where
|
My test case looks like this, with hashes in URIs and it works: #This is the first comment line; it should be removed
# and this second line with a space after the hash should be removed
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#> # <- uri with hash should be left alone, but this comment removed
PREFIX foi: <http://publishmydata.com/def/ontology/foi/>
SELECT * WHERE {
# This should also be removed
GRAPH ?g { #and this #dsf/># also
?s ?p ?o .
}
} The result of applying the comment-stripper was: PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
PREFIX foi: <http://publishmydata.com/def/ontology/foi/>
SELECT * WHERE {
GRAPH ?g {
?s ?p ?o .
}
} I was quite pleased with the positive results, given the relatively simple regex: (\s+#\s*[^\n]+)|(^#\s*[^\n]+) However
I appreciate your thoroughness @RickMoynihan, especially given that I didn't consider multiline strings, but that might throw a spanner in the works. My initial impression is that leaving multiline strings containing hashes intact isn't feasible with the current, simple regex/replace approach. I'll have a quick look into it, but covering this case may require parsing and tokenising (?), or else a thoroughly nightmarish and potentially unpredictable regex. Could this be expanding the scope further than is practicable without doing a lot more work? Some options would be to:
RE: option 4: I will compose a follow-up comment with details of the Sesame issue. I didn't post this yesterday because it was late and too much to write at the time - wasn't intending to seem vague. Please stand by!
Don't know! Investigating ... |
Some background to the Sesame issue. Although we believe the comments sometimes cause issues elsewhere within a SPARQL file, the following information specifically relates to a problem seen with SPARQL files that include a SELECT with multiline comments at the start of the file, e.g. # First line of comments
# Second line of comments
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
PREFIX foi: <http://publishmydata.com/def/ontology/foi/>
SELECT * WHERE {
?s ?p ?o .
} The error message resulting from this query was particularly odd, given that one would expect some sort of SPARQL parsing or syntax exception:
Why would there be an HTTP error concerning the Accept mime format, given that nothing else had changed about the query, except for a couple of comment lines? In Grafter, we call prepareQuery here: (defn prepare-query
"Low level function to prepare (parse, but not process) a sesame RDF
query. Takes a repository a query string and an optional sesame
Dataset to act as a query restriction.
Prepared queries still need to be evaluated with evaluate."
([repo sparql-string] (prepare-query repo sparql-string nil))
([repo sparql-string restriction]
(let [conn (->connection repo)]
(let [pq (.prepareQuery conn
QueryLanguage/SPARQL
sparql-string)]
(when restriction (.setDataset pq restriction))
pq)))) For But how does Sesame determine the correct query type object to prepare and instantiate? First, inside Unfortunately, this means that a comment line will be left at the top of the file, if there was more than one line of comments. public static String removeSPARQLQueryProlog(String queryString) {
String normalizedQuery = queryString;
// strip all prefix declarations
Pattern pattern = Pattern.compile("prefix[^:]+:\\s*<[^>]*>\\s*", Pattern.CASE_INSENSITIVE);
Matcher matcher = pattern.matcher(queryString);
int startIndexCorrection = 0;
while (matcher.find()) {
normalizedQuery = normalizedQuery.substring(matcher.end() - startIndexCorrection,
normalizedQuery.length());
startIndexCorrection += (matcher.end() - startIndexCorrection);
}
normalizedQuery = normalizedQuery.trim();
// strip leading comment before base (if present)
pattern = Pattern.compile("^#[^\n]+");
matcher = pattern.matcher(normalizedQuery);
if (matcher.find()) {
normalizedQuery = normalizedQuery.substring(matcher.end(), normalizedQuery.length());
normalizedQuery = normalizedQuery.trim();
}
// strip base declaration (if present)
pattern = Pattern.compile("^base\\s+<[^>]*>\\s*", Pattern.CASE_INSENSITIVE);
matcher = pattern.matcher(normalizedQuery);
if (matcher.find()) {
normalizedQuery = normalizedQuery.substring(matcher.end(), normalizedQuery.length());
}
// strip leading comment after base (if present)
pattern = Pattern.compile("^#[^\n]+");
matcher = pattern.matcher(normalizedQuery);
if (matcher.find()) {
normalizedQuery = normalizedQuery.substring(matcher.end(), normalizedQuery.length());
}
return normalizedQuery.trim();
}
} What happens next is that Sesame simply looks to see if the remaining SPARQL string starts with a Because we still have a comment line left at the beginning of our SPARQL string, Sesame can't determine the query type and defaults to a public Query prepareQuery(QueryLanguage ql, String query, String base)
throws RepositoryException, MalformedQueryException
{
if (SPARQL.equals(ql)) {
String strippedQuery = QueryParserUtil.removeSPARQLQueryProlog(query).toUpperCase();
if (strippedQuery.startsWith("SELECT")) {
return prepareTupleQuery(ql, query, base);
}
else if (strippedQuery.startsWith("ASK")) {
return prepareBooleanQuery(ql, query, base);
}
else {
return prepareGraphQuery(ql, query, base);
}
}
throw new UnsupportedOperationException("Unsupported query language " + ql);
} |
Thanks @scottlowe. I've seen comments at the top of files cause problems, but I think I've also seen single line comments in the body of the query cause problems too (when there's a |
@RicSwirrl Yes, that's my experience too... comments occasionally cause issues elsewhere, but by stripping them all, we will have that secondary issue covered as well, as you've righty surmised. |
Ok good explanation @scottlowe, and makes perfect sense about what's happening. If you're looking at testing this with RDF4j you can check by using this branch/ns https://github.com/Swirrl/grafter/blob/0.12.x-SNAPSHOT/src/grafter/rdf4j/sparql.clj That branch should contain the most recent RDF4j with some grafter changes to support it... Though IIRC the test suite is a broken currently on that branch, the code itself should work so you should be able to verify if the bug is still present easy enough. |
@RickMoynihan Will do. I'll look at that 0.12.x version of Grafter. I don't need the tests; I can run some code in the REPL. |
I've looked at the latest RDF4j code. It looks to be better structured overall and is architecturally different to Sesame, however, when you get into the details the Sesame logic relating to our issue persists: The way the prolog is removed is substantially different though, via a lexer rather than regexes: Whilst the effect of leaving a comment hanging is slightly different in this case, with the first comment line truncated differently to the Sesame version, the resulting bug is effectively the same though: ed
# and this second line with a space after the hash
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
PREFIX foi: <http://publishmydata.com/def/ontology/foi/>
SELECT * WHERE {
GRAPH ?g {
?s ?p ?o .
}
} Regardless of how the code looks, I've actually run this RDF4j code from within a Grafter REPL and it fails with the same error as expected, so it's still "broken". 🤔 Just wondering if there is a W3C rule somewhere saying that you can only have a single comment on the first line, before prefixes. Haven't found any evidence supporting that though, so I'll assume that we're right at this time. I will submit a comment-stripping patch for consideration, shortly. |
Some sparql query files file to interpolate properly. This query
failed with this error:
But when I got rid of the leading comment it worked OK.
Note: some other queries seem ok with comments. Maybe it's the VALUES clause in combination with a comment that is the problem(?).
The text was updated successfully, but these errors were encountered: