Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Source filtering use cases #1

Closed
rubensworks opened this issue Feb 5, 2020 · 6 comments
Closed

Source filtering use cases #1

rubensworks opened this issue Feb 5, 2020 · 6 comments

Comments

@rubensworks
Copy link
Member

As discussed in rdfjs/stream-spec#16, this issue serves as a place to collect use cases for why additional filtering capabilities are needed in the Source interface.

@jacoscaz
Copy link
Contributor

jacoscaz commented Feb 6, 2020

In our case, we'd like to have additional filtering capabilities for which to optimize at the persistence level in order to reduce the amount of in-memory filtering we would otherwise have to do in order to satisfy a given query.

This would bring RDF/JS a little closer to most common specs dealing with data management at a persistence level (SQL, SPARQL, ...). Practical use cases are near endless and I have a hard time thinking of a use case that would not benefit such a feature. In our specific case, we often work with large numbers of IoT devices and simple queries such as give me all sensors whose latest datapoint we have received within the last 24 hr often results in having to filter out thousands of records in-memory. The effects of this grow exponentially when we have filtering queries depending on other filtering queries.

@rubensworks
Copy link
Member Author

Our use case would be on the optimization of query engines by pushing down filters within the query plan to the storage level. As such, this matches the use case of @jacoscaz very well.

@bergos
Copy link
Member

bergos commented Feb 23, 2020

The current interface allows accessing time series only if the interval is known and the timestamps are aligned. Then it's possible to do something like this:

const subject = source.match(null, null, rdf.literal('2020-01-01T01:00'))[0].subject
const observation = source.match(subject)

I would expect from a filter interface to be an evolution of the match method that allows finding items in a range without a query engine. Different people may have different opinions what is a query engine. I would use the term in the RDF context for a piece of software that combines the results of different triple patterns. Using an index to solve a triple patterns is not a query engine.

const subject = source.matchFilter(null, null, filter.and(
  filter.gt(rdf.literal('2020-01-01T00:55:00.000')
  filter.lte(rdf.literal('2020-01-01T01:00:00.000')
))[0].subject
const observation = source.match(subject)

@bergos
Copy link
Member

bergos commented Feb 23, 2020

It should be possible to define custom filters. Below is an example for a text search filter.

source.matchFilter(null, ns.rdfs.label, filterThisText)

Where the filterThisText could be defined like this:

function textSearchFilter (text) {
  const test = term => {
    return term.value.toLowerCase().includes(text.toLowerCase())
  }

  return {
    termType: 'Filter',
    type: 'CUSTOM_TEXT_SEARCH',
    args: text,
    test
  }
}

@rubensworks rubensworks transferred this issue from rdfjs/stream-spec Aug 18, 2020
@jacoscaz
Copy link
Contributor

jacoscaz commented Aug 18, 2020

Following from rdfjs/data-model-spec#167 (comment) and rdfjs/data-model-spec#167 (comment), I’d like to address @rubensworks’ following concern:

A pipeline-based architecture is interested, I hadn't thought of that before. I'm just wondering if it's expressive enough for all types of queries. I would suspect recursive filter definitions may be a bit more expressive, which is what SPARQL algebra does.

I think you are definitely right - recursive definitions are more expressive. However, they also seem to be significantly harder to deal with from a development perspective and for quadstore I‘ve intentionally opted for a compromise between the two that allows me to keep the optimization part of the codebase relatively straightforward. In particular, I have found that re-ordering queries by their approximate counts (something I am still working on) becomes a mess really quickly when using the recursive approach. Granted, this could be due to a deficiency on my side rather than an objective difference in handling complexity but be as it may, I find myself having a much easier time when adopting the pipeline approach.

I am not at all opposed to recursiveness, though, and I will get a chance to explore that further once I move to switch from sparqljs to sparqlalgebrajs in version 8.

@rubensworks
Copy link
Member Author

Done in #4.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants