Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FilterableSource] Add order parameter? #8

Closed
rubensworks opened this issue Oct 29, 2021 · 8 comments
Closed

[FilterableSource] Add order parameter? #8

rubensworks opened this issue Oct 29, 2021 · 8 comments

Comments

@rubensworks
Copy link
Member

An important aspect during query execution is the ability to exploit sort orders of result streams, as this can significantly improve performance during join processing.

To enable query engines to ask for certain orders, we may consider adding an additional parameter to the FilterableSource interface, which could look something like this:

interface FilterableSource {
  FilterResult matchExpression (
    optional Term? subject,
    optional Term? predicate,
    optional Term? obj,
    optional Term? graph,
    optional Expression? expression,
    optional { attribute QuadOrder? order } options,
  );
};
type QuadOrder = { term: string; order: 'asc' | 'desc' }[];

Furthermore, to help query engines out during query optimization, we could also add all available orders to the result metadata:

interface QueryResultMetadata {
  attribute QueryResultMetadataCount? count;
  attribute QuadOrder? order;
  attribute QuadOrder[]? alternativeOrders;
};

\cc @jacoscaz

@jacoscaz
Copy link
Contributor

I'm very much +1 on this. A thought: given that an engine might want to get available orderings before asking for a specific one, wouldn't the optional order param be more appropriate for the FilterResults#quads() method?

That would allow for something like the following:

const results = source.matchExpression(/* ... */);
const { availableOrders } = await results.metadata();

/* check available orders */

const stream = results.quad({ order: someSpecificOrder });

@rubensworks
Copy link
Member Author

A thought: given that an engine might want to get available orderings before asking for a specific one, wouldn't the optional order param be more appropriate for the FilterResults#quads() method?

In general, that should work indeed.

However, I'm just wondering if there are cases where metadata would be dependent on the quads order. E.g., to indicate some cost, as certain orders may be more expensive to obtain (due to sorting or more expensive indexes).
In such cases, a new call to matchExpression might be beneficial.

@jacoscaz
Copy link
Contributor

However, I'm just wondering if there are cases where metadata would be dependent on the quads order.

Good point. In quadstore's case, for example, orderings that do not match any of the configured indexes require in-memory sorting. The spec should provide the means for quadstore to relay this information to engines downstream.

However, I think there's a way to reconcile not having to call matchExpression() repeatedly for the same expression with passing of cost-related information between source and engine and index selection:

interface QuadOrder {
  cost: any; // additional cost, type TBD
  terms: { term: string, direction: 'asc' | 'desc' }[];
}

interface QueryResultMetadata {
  /* ... */
  cost: any; // basic cost, type TBD
  availableOrders?: QuadOrder;
}

I think decoupling the basic query cost (query setup, whether quads are filtered using an index or in-memory, ...) from the additional cost brought by sorting would make for a more expressive representation.

@rubensworks
Copy link
Member Author

Good idea, we can take that approach indeed!
This would even allow us to keep the matchExpression interface unchanged.


Regarding this cost metadata, is assume this is something we want to include in the spec already?

In Comunica, we use the following fields to indicate costs:

    /**
     * An estimation of how many iterations over items are executed.
     * This is used to determine the CPU cost.
     */
    iterations: number;
    /**
     * An estimation of how many items are stored in memory.
     * This is used to determine the memory cost.
     */
    persistedItems: number;
    /**
     * An estimation of how many items block the stream.
     * This is used to determine the time the stream is not progressing anymore.
     */
    blockingItems: number;
    /**
     * An estimation of the time to request items from sources.
     * This estimation can be based on the `cardinality`, `pageSize`, and `requestTime` metadata entries.
     * This is used to determine the I/O cost.
     */
    requestTime: number;

All of these may also apply here, as they can help explain index access and sorting costs.

@ericprud
Copy link

ericprud commented Nov 2, 2021

Is there some intention that the ordering be combined with slicing to provide quasi-stable paging?

ORDER BY ?x DESC(?y)
OFFSET 150 LIMIT 50

@rubensworks
Copy link
Member Author

Is there some intention that the ordering be combined with slicing to provide quasi-stable paging?

Ah yes, good point.

Stores like HDT (perhaps quadstore as well?) can do very efficient offsets (and limits), so it would definitely make sense to also include this next to the sort order.

So we could have something like results.quad({ order: someSpecificOrder, offset: 150, limit: 50 });

@jacoscaz
Copy link
Contributor

jacoscaz commented Nov 2, 2021

I'm also very +1 on supporting offset and limit. However, these might be something to account for in the computation of costs (e.g. iterations would likely be influenced by limit). Perhaps these would be better off in the options argument to matchExpression?

@jacoscaz
Copy link
Contributor

We've recently merged #7, which includes the start and length parameters in the Filterable interface. I think we can close this one.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants