Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Regex in replace_query_select is buggy #122

Closed
lu-pl opened this issue Oct 28, 2024 · 3 comments · Fixed by #123
Closed

Regex in replace_query_select is buggy #122

lu-pl opened this issue Oct 28, 2024 · 3 comments · Fixed by #123
Assignees
Labels
bug Something isn't working

Comments

@lu-pl
Copy link
Contributor

lu-pl commented Oct 28, 2024

The regex used in rdfproxy.utils.sparql_utils.replace_query_select_clause is completely inappropriate for the task at hand. E.g. it is currently not possible for rdfproxy to deal with query strings containing newlines like

query = """
select ?parent \n ?child ?name
where {
    values (?parent ?child ?name) {
        ('x' 'c' 'foo')
        ('y' 'd' UNDEF)
        ('y' 'e' UNDEF)
        ('z' UNDEF UNDEF)
    }
}
"""

In that case, the current regex select\s.* matches until \n and creates

select (count(*) as ?cnt) ?child ?name
where {
    values (?parent ?child ?name) {
        ('x' 'c' 'foo')
        ('y' 'd' UNDEF)
        ('y' 'e' UNDEF)
        ('z' UNDEF UNDEF)
    }
}

which is invalid SPARQL and leads to the org.openrdf.query.MalformedQueryException: Bad aggregate [...] Java when targeting Blazegraph/Wikidata (or any other Jena-based triplestore).

Solution proposal

A much more apt regex for selecting the entire SELECT clause would be (select\s+[\s\S]*?)(?=\s+where).
This non-greedily matches the SELECT clause with all variable references and performs a lookahead for WHERE.

@lu-pl lu-pl self-assigned this Oct 28, 2024
@lu-pl lu-pl added the bug Something isn't working label Oct 28, 2024
@lu-pl
Copy link
Contributor Author

lu-pl commented Oct 29, 2024

For the above mentioned Blazegraph exception see the SPARQL Specification 'Aggregate Projection Restrictions'

@lu-pl
Copy link
Contributor Author

lu-pl commented Oct 29, 2024

Note: It is currently not possible to recognize the above query as invalid in a query sanity check using RDFLib, see #2960

@lu-pl
Copy link
Contributor Author

lu-pl commented Nov 6, 2024

Note: A better version of the pattern would be \A(select\s+[\s\S]*?)(?=\s+where) with an absolute anchor.

Since this is used with re.sub which substitutes the "leftmost non-overlapping occurrences of [the] pattern", this doesn't really matter though.

lu-pl added a commit that referenced this issue Nov 6, 2024
The new regex non-greedily matches everything after "select" and before "where".

"[\s\S]*?" basically means ".*?" but is more general because it also
matches linebreaks without the re.DOTALL flag.
See https://docs.python.org/3/library/re.html#re.DOTALL.

Fixes #122.
lu-pl added a commit that referenced this issue Nov 6, 2024
The tests run variations of the example given in #122.
Every test implemented here fails without the fix introduced with
this PR.
lu-pl added a commit that referenced this issue Nov 7, 2024
The new regex non-greedily matches everything after "select" and before "where".

"[\s\S]*?" basically means ".*?" but is more general because it also
matches linebreaks without the re.DOTALL flag.
See https://docs.python.org/3/library/re.html#re.DOTALL.

Fixes #122.
lu-pl added a commit that referenced this issue Nov 7, 2024
The tests run variations of the example given in #122.
Every test implemented here fails without the fix introduced with
this PR.
lu-pl added a commit that referenced this issue Nov 8, 2024
The new regex non-greedily matches everything after "select" and before "where".

"[\s\S]*?" basically means ".*?" but is more general because it also
matches linebreaks without the re.DOTALL flag.
See https://docs.python.org/3/library/re.html#re.DOTALL.

Fixes #122.
lu-pl added a commit that referenced this issue Nov 8, 2024
The tests run variations of the example given in #122.
Every test implemented here fails without the fix introduced with
this PR.
lu-pl added a commit that referenced this issue Nov 8, 2024
The new regex non-greedily matches everything after "select" and before "where".

"[\s\S]*?" basically means ".*?" but is more general because it also
matches linebreaks without the re.DOTALL flag.
See https://docs.python.org/3/library/re.html#re.DOTALL.

Fixes #122.
lu-pl added a commit that referenced this issue Nov 8, 2024
The tests run variations of the example given in #122.
Every test implemented here fails without the fix introduced with
this PR.
lu-pl added a commit that referenced this issue Nov 8, 2024
The new regex non-greedily matches everything after "select" and before "where".

"[\s\S]*?" basically means ".*?" but is more general because it also
matches linebreaks without the re.DOTALL flag.
See https://docs.python.org/3/library/re.html#re.DOTALL.

Fixes #122.
lu-pl added a commit that referenced this issue Nov 8, 2024
The tests run variations of the example given in #122.
Every test implemented here fails without the fix introduced with
this PR.
@lu-pl lu-pl closed this as completed in #123 Nov 8, 2024
@lu-pl lu-pl closed this as completed in cb15e14 Nov 8, 2024
lu-pl added a commit that referenced this issue Nov 8, 2024
The tests run variations of the example given in #122.
Every test implemented here fails without the fix introduced with
this PR.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant