Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SOLR-11831: Skip second grouping step if group.limit is 1 (aka Las Vegas Patch) #300

Open
wants to merge 23 commits into
base: master
Choose a base branch
from

Conversation

mjosephidou
Copy link

Summary:
In cases where we do grouping and ask for {{group.limit=1}} only it is possible to skip the second grouping step. In our test datasets it improved speed by around 40%.

Essentially, in the first grouping step each shard returns the top K groups based on the highest scoring document in each group. The top K groups from each shard are merged in the federator and in the second step we ask all the shards to return the top documents from each of the top ranking groups.

If we only want to return the highest scoring document per group we can return the top document id in the first step, merge results in the federator to retain the top K groups and then skip the second grouping step entirely.

boolean byRelevanceOnly = false;
SortField[] sortFields = groupingSpec.getGroupSort().getSort();

if(sortFields != null && sortFields.length == 1 && sortFields[0] != null && sortFields[0].getComparator() instanceof FieldComparator.RelevanceComparator) {
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SortField => getComparator(final int numHits, final int sortPos) .. differs in signature of the api..

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, the PR went out of sync, we will update it asap

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, the PR went out of sync ...

If you'd like bloomberg#228 pull request to merge into the https://github.com/bloomberg/lucene-solr/tree/SOLR-11831 branch (the branch behind this pull request here) has an interim 'make it compile (somehow)' change (also includes 'logger' to 'LOG' renames in this method).

}

// TODO: At the moment the optimization does not support reranking
if(isReranking) {
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will ltr not benefit from this optimization?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes! if this PR gets accepted, we will submit another one on LTR

@@ -233,6 +233,41 @@ public void prepare(ResponseBuilder rb) throws IOException
}
}

private boolean allowSkipSecondGroupingStep(final GroupingSpecification groupingSpec, final boolean isReranking ) {
// Only possible if we only want one doc per group
if (groupingSpec.getGroupLimit() != 1) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

minor: getGroupLimit() is one of the deprecated GroupingSpecification accessors; something like cpoerschke@5c227c4 could be a replacement (that commit also includes parameterised logging i.e. {} instead of +)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, Applied in eca5af7

* the Solr id and the score of the document */
public Object topDocSolrId;
public float topDocScore;

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm wondering about additional object/GC footprint (for anyone who does not benefit from that) associated with the addition of three members here and opened https://issues.apache.org/jira/browse/LUCENE-8728 to explore further around that.

@diegoceccarelli
Copy link
Contributor

@cpoerschke Thanks!

@cpoerschke @mjosephidou I rebased the patch on top of the current master and applied @cpoerschke suggestions - precommit and unit tests succeeded.

One thing that I want to discuss and fix is the value of numFound for each group, at the moment is always 1 (I'm relying what written in the comments) - but in case of one shard or shards partitioned according to the grouping field, it might be possible to return the exact number. I would support that, what do you think? at the same time, in case of multiple shards numFound will not be reliable, but numFound can't be disabled or requested, should we just write proper documentation for this behaviour?

if (rb.stage == ResponseBuilder.STAGE_EXECUTE_QUERY && rb.getGroupingSpec().isSkipSecondGroupingStep()) {
shardRequestFactory = new StoredFieldsShardRequestFactory();
nextStage = ResponseBuilder.STAGE_DONE;
rb.stage = ResponseBuilder.STAGE_GET_FIELDS;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Background re: existing code:

SearchHandler.handleRequestBody calls distributedProcess on all configured components c and via the

nextStage = Math.min(nextStage, c.distributedProcess(rb));

formula each component thus gets a say in what nextStage is. The overall decision is then effected via the rb.stage = nextStage; assignment.

Observation re: this proposed QueryComponent.groupedDistributedProcess change:

If QueryComponent also assigns to rb.stage then that could 'confuse' the SearchHandler logic e.g. specifically components running after the QueryComponent would not have an opportunity to do anything in the ResponseBuilder.STAGE_EXECUTE_QUERY stage.

Question/Suggestion:

Might an alternative be to turn the ResponseBuilder.STAGE_EXECUTE_QUERY stage into a 'no op' as far as the QueryComponent (in isSkipSecondGroupingStep==true circumstances) is concerned? bloomberg#229 aims to illustrate an alternative QueryComponent.groupedDistributedProcess modification, changes elsewhere might be needed too (haven't looked into that yet).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @cpoerschke, I agree that we should skip stages in the middle, I merged your draft and later and I'm going to rerun the tests to see if it still works as intended or not.

/* If we are skipping the second grouping step we want to translate the response of the
* first step in the response of the second step and send it to the get_fields step.
*/
processSkipSecondGroupingStep(rb, mergedTopGroups, tempSearchGroupToShards, docIdToShard, groupSort, fields, groupField);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Two things and one question 'jumped out' at me when looking at this specific proposed change here:
(1) The new local docIdToShard variable is allocated and accumulated on both if-and-else code paths but only used in the if-case.
(2) The processSkipSecondGroupingStep takes seven arguments which (subjectively) seems a lot.
(3) question: There's no rb.mergedSearchGroups.put and rb.searchGroupToShards in the if-case -- why might that be and would that be okay?

Taking a closer look I learnt the following:
(4) In the processSkipSecondGroupingStep method there is an rb.searchGroupToShards.put call.
(5) Two of the processSkipSecondGroupingStep arguments are contained in the rb argument, namely groupSort and fields.
(6) One of the processSkipSecondGroupingStep arguments i.e. tempSearchGroupToShards is used only for the rb.searchGroupToShards.put call. If that call could be factored out of the method then that would reduce the if-vs-else-case differences and save one method argument.
(7) The new local docIdToShard variable has conceptual similarity to the existing tempSearchGroupToShards variable i.e both map 'something' to a shard or shards:
(7a) docIdToShard is a one-to-one mapping since a single document lives on exactly one shard
(7b) tempSearchGroupToShards is a one-to-mapping mapping since a single group can contain multiple documents that (between them) live on multiple shards

Next then I did some exploratory refactoring to see if/how:
(8) unnecessary allocation and accumulation of docIdToShard might be avoided for (existing) code paths that do not or cannot use the group.skip.second.step=true optimisation
(9) the processSkipSecondGroupingStep signature might be narrowed to take fewer than seven arguments

What emerged was a deceptively simple idea:
(10) replace the Map<String, Map<SearchGroup<BytesRef>, Set<String>>> tempSearchGroupToShards; local variable with a SearchGroupsContainer searchGroupsContainer; local variable and give the SearchGroupsContainer inner class methods that do what the code currently does i.e. just code refactoring
(11) for the group.skip.second.step=true optimisation extend the SearchGroupsContainer class and override methods as required

https://github.com/cpoerschke/lucene-solr/tree/github-bloomberg-SOLR-11831-cpoerschke-4 shares what emerged, it compiles but is totally untested:

  • cpoerschke@6f53499 shows the differences relative to the branch behind the pull request here
  • 5c14302...6f53499 shows the ('all in') differences relative to the master branch baseline code

What do you think?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @cpoerschke for the analysis. Please find the answers:

(1) I agree, allocation can be avoided.
(2) I agree, I'll see if it is possible to reduce or wrap them in a private object
(3) rb.mergedSearchGroups
It is used only to generate the second group request

for (Map.Entry<String, Collection<SearchGroup<BytesRef>>> entry : rb.mergedSearchGroups.entrySet()) {

... That we are going to skip if we apply the skip second step:
https://github.com/bloomberg/lucene-solr/blob/SOLR-11831/solr/core/src/java/org/apache/solr/handler/component/QueryComponent.java#L573

(4-7b) Yes

(10, 11) I Merged your changes, and I'll check that the tests are still fine, and that precommit is still happy

Thanks

@@ -34,17 +41,37 @@
/**
* Implementation for transforming {@link SearchGroup} into a {@link NamedList} structure and visa versa.
*/
public class SearchGroupsResultTransformer implements ShardResultTransformer<List<Command>, Map<String, SearchGroupsFieldCommandResult>> {
public abstract class SearchGroupsResultTransformer implements ShardResultTransformer<List<Command>, Map<String, SearchGroupsFieldCommandResult>> {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

An existing public class becoming abstract here 'jumped out' at me a little since it could be considered to be breaking backwards compatibility?

The cpoerschke@7e47e01 commit explores removal/replacement of the SearchGroupsResultTransformer.getInstance wrapper and a next step could be for the proposed SearchGroupsResultTransformer.SkipSecondStepSearchResultResultTransformer inner class to become a SkipSecondStepSearchResultResultTransformer.java class of its own and the proposed SearchGroupsResultTransformer.DefaultSearchResultResultTransformer inner class to not be created i.e. SearchGroupsResultTransformer here to not become abstract and its constructor to remain public. This of course assumes that SearchGroupsResultTransformer would require only simple changes to make it extensible and that then SkipSecondStepSearchResultResultTransformer could extend it.

(Note that the cpoerschke@7e47e01 commit is on https://github.com/cpoerschke/lucene-solr/tree/github-bloomberg-SOLR-11831-cpoerschke-5 branch and that branch includes https://github.com/cpoerschke/lucene-solr/tree/github-bloomberg-SOLR-11831-cpoerschke-4 changes for convenience only i.e. not out of necessity.)

What do you think?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks a lot better, I merged it with the previous commit.

One comment: instead of having SearchGroupsResultTransformer implementing the default behavior and SkipSecondStepSearchResultResultTransformer, I would compose the behavior: One class for the default behaviur, and one for the skipsStep` behavior. I'll try to sketch it in a PR to this branch.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

... instead of having SearchGroupsResultTransformer implementing the default behavior and SkipSecondStepSearchResultResultTransformer, I would compose the behavior ...

What would happen to SearchGroupsResultTransformer itself? Just asking with backwards compatibility in mind.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SearchGroupsResultsTransformer shouldn't change, it would just delegate the behaviour to a different class according to the skip flag.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

bloomberg#231 sketches the idea, what do you think?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @cpoerschke I applied you commit (with minor changes) in 9480482 and attempted a similar approach for transformToNative in 0eb206b. Now there is no duplication, but I think we still need to revisit [Default|SkipSecondStep]SearchResultResultTransformer... I'm not convinced about SkipSecondStep inheriting and overriding Default

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @diegoceccarelli for applying the serialise-one-group commit and for attempting a similar approach for the deserialisation! As you say, now there is no duplication though returning to this after a little while it seems to me at least a little bit tricky to understand where the boundaries are between the refactoring and the changes related to the skip-second-step logic i.e. to see that the refactoring does not break anything and to see that and why the skip-second-step logic will work as intended. Against that background https://issues.apache.org/jira/browse/SOLR-13585 proposes to do a pure factoring out of two methods first (albeit with foresight of the envisaged code changes here) and then hopefully the skip-second-step logic addition would be clearer to see. Admittedly though such a split approach would make for a mildly messy rebase for the PR branch here. What do you think?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @cpoerschke https://issues.apache.org/jira/browse/SOLR-13585 Looks good to me, I would merge it and then rebase this code on top of it..

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @diegoceccarelli for the SOLR-13585 review! I'll aim to commit it on Monday then. Rebasing on top of that will also pick up other grouping related changes that happened recently-ish i.e. they will be good to have as well.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SOLR-13585 is now committed to master and branch_8x branches; apologies for the delay, the 8.2.0 release process was already underway and I was waiting for the 8.3.0 section to be ready in the solr/CHANGES.txt file.

@@ -67,8 +67,14 @@ public void transform(Map<String, ?> result, ResponseBuilder rb, SolrDocumentSou
for (GroupDocs<BytesRef> group : topGroups.groups) {
SimpleOrderedMap<Object> groupResult = new SimpleOrderedMap<>();
if (group.groupValue != null) {
final String groupValue;
if (rb.getGroupingSpec().isSkipSecondGroupingStep()) {
groupValue = groupField.getType().indexedToReadable(group.groupValue.utf8ToString());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Q: So on this bit here I was wondering why the indexedToReadable is needed in the isSkipSecondGroupingStep case.

A: The TopGroupsResultTransformer which would be called in the not-isSkipSecondGroupingStep case has it, just the indexedToReadable that is. And the groupFieldType.toObject(groupField.createField(...)) wrap in this method is needed to 'have' e.g. the name part of the group field i.e. not just the group value stored in the groupResult map. Something along those lines.

Anyhow, then I wondered about whether or not the GroupedEndResultTransformer "needs to know" about the GroupingSpecification.isSkipSecondGroupingStep logic. cpoerschke@ce5d9a9 commit sketches how the "knowing" might be avoided, though I'm not entirely convinced by the approach for three reasons:

  • some difficulty in explaining (as evidenced by lack of javadocs) what the new GroupedEndResultTransformer.groupValueToString method really does and why it's beneficial,
  • the realisation that GroupedEndResultTransformer already "knows" about GroupingSpecification itself i.e. there are two docList.setStart(rb.getGroupingSpec().getWithinGroupOffset()); calls, and
  • the observation that after the change all GroupingSpecification.isSkipSecondGroupingStep callers are in QueryComponent which raises a "is it a grouping parameter or a query component parameter?" question.

Either way, an interesting question arose from the exercise:

  • Making the QueryComponent change in cpoerschke@ce5d9a9 brought attention to the group.main=true and group.format=simple Grouping Parameters code paths.
    • The EndResultTransformer code for those parameters is apparently not affected, and intuitively it would seem that it need not be (since the group value is not(?) included in the output) whilst still benefitting from the isSkipSecondGroupingStep logic. Though it might be nice to investigate that further and perhaps add test coverage for it?

Copy link
Contributor

@diegoceccarelli diegoceccarelli Jun 4, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added regression test for group.main=true and group.format=simple
in 60d6052

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

wrt cpoerschke/lucene-solr@ce5d9a9 , I wouldn't do that change - the check rb.getGroupingSpec().isSkipSecondGroupingStep() is just moved into QueryComponent, but it affects only the behavior in GroupedEndResultTransformer.transform so I would keep it there..

/** activates optimization in case only one document per group.
* Setting this to true is only compatible with group.limit = 1
*/
public static final String GROUP_SKIP_DISTRIBUTED_SECOND = GROUP + ".skip.second.step";
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about using this GROUP_SKIP_DISTRIBUTED_SECOND constant more widely throughout the code e.g. when logging and in tests? cpoerschke@6bdf872 commit sketches how it might be done. This could help code readers locate code related to the optimisation.

Copy link
Contributor

@diegoceccarelli diegoceccarelli Jun 3, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done in c5553ed


groupingSpec.setSkipSecondGroupingStep(params.getBool(GroupParams.GROUP_SKIP_DISTRIBUTED_SECOND, false));
boolean isReranking = (rb.getRankQuery() != null);
if (groupingSpec.isSkipSecondGroupingStep() & !allowSkipSecondGroupingStep(groupingSpec.getWithinGroupSortSpec(), groupingSpec.getGroupSortSpec(), isReranking)){
Copy link
Contributor

@cpoerschke cpoerschke Apr 5, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

observation: allowSkipSecondGroupingStep currently does 'error' logging in response to illegal grouping specifications in a request. this probably should not be 'error' or 'warn' logging (since there is nothing 'wrong' with Solr itself) but 'info' logging at most or no logging at all with the details returned in the Illegal grouping specification that is thrown here. An alternative could also be to 'info' log and not throw but to just ignore the group.skip.second.step=true parameter if it's invalid.

edit: a little further below in the class there's a cmd.setSegmentTerminateEarly(false); // not supported, silently ignore any segmentTerminateEarly flag for example

Copy link
Contributor

@diegoceccarelli diegoceccarelli Jun 3, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done in 725bfc8

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The QueryComponent.allowSkipSecondGroupingStep method being passed and (almost) only using two of the GroupingSpecification's parameters made me wonder if the allow logic might be moved into a validate() or similarly named method in that class -- cpoerschke@70e9c19 has a sketch -- the logic being there would make it potentially available also to other non-QueryComponent plugins and the explicit 'optionally call validate after all the setters' aspect might also be helpful in terms of code comprehension. What do you think?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like it, but what about validate being void and throwing a SolrException in case of problems? in this case we can return the description of the particular problem in the exception and we just call validate at the end of the setup. what do you think ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, void and throwing would work too, good idea, as you say more details can be returned that way.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done in 0563e66

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

edit: a little further below in the class there's a cmd.setSegmentTerminateEarly(false); // not supported, silently ignore any segmentTerminateEarly flag for example

Checking for this and failing if it is true will modify the current behavior - do you think we should fail?

}

// Within group sort must be the same as group sort because if we skip second step no sorting within group will be done.
if (withinGroupSpecification.getSort() != groupSort.getSort()) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

question/thought: is full equality required technically speaking or would prefix equality be sufficient? e.g. if the sorting within the group was by X,Y,Z and the sorting of groups was by X,Y. that way around in terms of prefixes, i think, or maybe the other way around.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree, done in c5024d0

@Override
public void addSearchGroupToShards(ResponseBuilder rb, String groupField, Collection<SearchGroup<BytesRef>> mergedTopGroups) {
super.addSearchGroupToShards(rb, groupField, mergedTopGroups);

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So today I was seeking to understand the logic in this SkipSecondStepSearchGroupShardResponseProcessor.addSearchGroupToShards method more.

Analysis:

  • In the existing i.e. group.skip.second.step=false code paths QueryComponent.handleGroupedResponses logic involves the SearchGroupShardResponseProcessor and TopGroupsShardResponseProcessor classes for the first and second steps respectively.
  • In the new i.e. group.skip.second.step=true code paths there is no second step and so some of what TopGroupsShardResponseProcessor would do if it was called needs to be done elsewhere.
  • On the group.skip.second.step=true code path the SkipSecondStepSearchGroupShardResponseProcessor is used instead of the SearchGroupShardResponseProcessor and so structurally there already is a possibility for the former to 'do more' than the latter.

Suggestion:

  • Factoring out of a static TopGroupsShardResponseProcessor.fillResultIds method to reduce code duplication and help ensure both code paths (continue to) use the same result ids filling logic (going forward), the cpoerschke@1570f7c commit has a sketch.

What do you think?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree, I applied your change (removing public from TopGroupsShardResponseProcessor.fillResultIds , package visibility is enough)
commit a7df0c0

final String shard = docIdToShard.get(group.topDocSolrId);
assert(shard != null);
ShardDoc sdoc = new ShardDoc(group.topDocScore,
fields,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The passing of fields to the ShardDoc here seems incorrect to me:

On the group.skip.second.step=true code path that particular argument is not actually used in practice, hence this name-vs-value difference can easily go unnoticed. Something like commit cpoerschke@2b3069b would be a way to use the other ShardDoc constructor.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, added in 72f559a

@diegoceccarelli
Copy link
Contributor

diegoceccarelli commented Jun 6, 2019

@cpoerschke I think I addressed all your comments, let me know if you have more comments :) Thanks!

@@ -232,6 +233,47 @@ public void prepare(ResponseBuilder rb) throws IOException
}
}

// check if prefix is a prefix of the array
private static boolean isPrefix(SortField[] prefix, SortField[] array){
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's an existing ArrayUtil class, I wonder if it might welcome a (generic) version of this method. What do you think?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given the fact that it is used only here I would keep it private - and we might put it into ArrayUtil when another class needs it.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what do you think?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure. Yes, it's private and static but it seems a relatively long amount of what seems quite generic code and when another class needs something like it then it would be tricky to discover that the method already exists here and is available for relocation?

Having looked around a little, it looks like there's a Collections.indexOfSubList method in Java and if perhaps array-to-list conversion was acceptable then something like

0 == Collections.indexOfSubList(Arrays.asList(array), Arrays.asList(prefix))

might be another alternative?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah! I think so, let me try!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done in 0563e66


// Within group sort must be the same as group sort because if we skip second step no sorting within group will be done.
if (! isPrefix(withinGroupSortFields, groupSortFields)) {
System.out.println("Not prefix" + Arrays.toString(withinGroupSortFields)+ " , "+Arrays.toString(groupSortFields) );
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just FYI the latest ant precommit checks for print statements in production code. Suggest to either remove this one or to replace it with log.info equivalent. If logging here then perhaps the other return false should also log?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed (95ed12a), thanks. I would just throw the exception from the caller.

}

@Override
protected Object serializeOneSearchGroup(SearchGroup<BytesRef> searchGroup, SearchGroupsFieldCommand command) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cpoerschke@af1a746 has potential (untested) tweaks for this SkipSecondStepSearchResultResultTransformer.serializeOneSearchGroup method such as making uniqueField a class member to avoid per-document get-schema-then-get-unique-key-field calls and replacement of the DocumentStoredFieldVisitor-based document retrieval with the simpler equivalent of TopGroupsResultTransformer.retrieveDocument -- that would be equivalent, at least at a quick glance, unless perhaps the DocumentStoredFieldVisitor use has intended side effects such as cache warning (in which case maybe comments at point of use might be helpful for future readers of the code).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks 108faf8, I think that NamedList(5) was due to a previous solution that was using 5 fields, I just used the empty constructor for simplicity.


static void fillResultIds(ResponseBuilder rb){
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

https://issues.apache.org/jira/browse/SOLR-13576 opened for this fairly independent refactoring here.

@diegoceccarelli diegoceccarelli force-pushed the SOLR-11831 branch 2 times, most recently from dd17b86 to 4c7e5f1 Compare August 2, 2019 15:58
@diegoceccarelli
Copy link
Contributor

@cpoerschke I rebased on top of the current master, please note that:

  1. I had to change the returned type of serializeOneSearchGroup from Object[] to Object because SkipSecondStepSearchResultResultTransformer will return a NamedList

  2. I moved the check on QueryComponent cfd22cd (SOLR-12249) - ( when group.format=grouped then, validate group.offset) into GroupingSpecification::validate

@cpoerschke
Copy link
Contributor

I rebased on top of the current master ...

Thanks @diegoceccarelli!

... had to change the returned type of serializeOneSearchGroup from Object[] to Object because ...

Yes, that sounds right. The serializeOneSearchGroup factoring out in master was scoped on the master code at the time only i.e. it did not anticipate what would be needed subsequently e.g. with the SkipSecondStepSearchResultResultTransformer here.

@cpoerschke
Copy link
Contributor

Thanks again @diegoceccarelli for rebasing on top of the current master! With the serialise/deserialiseOneGroup methods in place the SearchGroupsResultTransformer in particular is really nice to read and understand now, even and especially after some time away from the code.

My cpoerschke@20129e7 commit has a couple of tweaks and suggestions, lumped together as one commit (sorry!) but do feel free to look and selectively pick as you like.

I'll also go and annotate on the PR here the suggested tweaks that are not related to code comprehension or style consistency. And then next (not today) I'm planning to look at the tests; the .adoc documentation changes are also on the known to-do list.

sdoc.id = group.topDocSolrId;
sdoc.shard = shard;

groups[index++] = new GroupDocs<>(group.topDocScore,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Passing Float.NaN for the first (score) argument looks to be possible and seems more accurate here?

Copy link
Contributor

@diegoceccarelli diegoceccarelli Sep 2, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm a bit concerned about using NaN because NaN cannot be compared with any floating type value I'll have to double check that we are not sorting on that.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, after reviewing the code I agree that is OK and better to use NaN

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, I double checked the code and it looks safe, merged - thank you

0, /*Set totalHitCount to 0 as we can't computed it as is */
0, /*Set totalGroupedHitCount to 0 as we can't computed it as is*/
groups,
maxScore);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cpoerschke@20129e7 proposes to initialise maxScore to Float.MIN_VALUE and to pass Float.NaN here if groups happened to be empty.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I added a test to check that the result if ok in case of 0 results

* to be unique so this is what we need to return to the federator
**/
try {
luceneDoc = searcher.doc(searchGroup.topDocLuceneId, Collections.singleton(uniqueField.getName()));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cpoerschke@20129e7 proposes to introduce a uniqueFieldNameAsSet member to avoid successive calls allocating identical singleton sets.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, merged

@cpoerschke
Copy link
Contributor

Thanks @diegoceccarelli for the updates and for splitting the unrelated maxScore change out into a separate PR! I've just opened https://issues.apache.org/jira/browse/SOLR-13812 for the also unrelated (though admittedly smaller and less controversional) SolrTestCaseJ4 change and with added test coverage for that too.

Malvina Josephidou and others added 23 commits October 4, 2019 14:46
…gas patch)

Summary:
In cases where we do grouping and ask for  {{group.limit=1}} only it is possible to skip the second grouping step. In our test datasets it improved speed by around 40%.

Essentially, in the first grouping step each shard returns the top K groups based on the highest scoring document in each group. The top K groups from each shard are merged in the federator and in the second step we ask all the shards to return the top documents from each of the top ranking groups.

If we only want to return the highest scoring document per group we can return the top document id in the first step, merge results in the federator to retain the top K groups and then skip the second grouping step entirely.

QueryComponent: interim 'make it compile (somehow)' change (#228)

add SearchGroupsContainer (#230)

factor out SearchGroupsResultTransformer.serializeOneSearchGroup method (Christine)

Refactor transformToNative adding deserializeOneSearchGroup

 increase GroupParams.GROUP_SKIP_DISTRIBUTED_SECOND use (see also 6bdf87)

Remove error logging in allowSkipSecondGroupingStep

Check that withinGroupSort is a prefix of groupSort

SkipSecondStepSearchGroupShardResponseProcessor.addSearchGroupToShards now leaves ShardDoc.fields null

factor out TopGroupsShardResponseProcessor.fillResultIds method

Add regression test group.main=true and group.format=simple

Improve GroupingSpecification validation adding validate() method

SkipSecondStepSearchResultResultTransformer.serializeOneSearchGroup tweaks
@diegoceccarelli
Copy link
Contributor

@cpoerschke what do you think about removing the helper function:

private void assertSimpleQueryThrows(Object... queryParams) {

and we replace the calls to assertSimpleQueryThrows with expectThrows(Exception.class, ... ) ?

@cpoerschke
Copy link
Contributor

... replace the calls to assertSimpleQueryThrows with expectThrows(Exception.class, ... ) ?

Sounds good to me.

Comment on lines +152 to +154
if (comparator instanceof FieldComparator.RelevanceComparator){
searchGroup.topDocScore = (Float)comparator.value(group.comparatorSlot);
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The searchGroup.topDocScore being filled in conditionally here i.e. only if it's a field comparator 'jumps out' a little.

Have not yet looked into what other comparators there are and if removing of the conditional might be practical.

But this also re-surfaced the SearchGroup.java question around adding or not adding additional fields and what (if any) alternative might be possible.

With this in mind https://github.com/cpoerschke/lucene-solr/commits/github-bloomberg-SOLR-11831-cpoerschke-13 explores the 'extend FirstPassGroupingCollector and SearchGroup' route a little further:

  • presume LUCENE-8728 changes or equivalent are available
  • add SolrFirstPassGroupingCollector and SolrSearchGroup classes
  • on the group.skip.second.step code paths always have SolrSearchGroup instead of SearchGroup
  • not yet done:
    • creation of SolrFirstPassGroupingCollector instead of FirstPassGroupingCollector if-and-only-if appropriate
    • tests need to pass again
    • consider if this would actually be comprehensible, maintainable and helpful e.g. w.r.t. (Solr)SearchGroup fields always being filled in.

What do you think?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Taking the conversation back from the commits on the above branch to here: bloomberg#232 sketches extending Lucene's FirstPassGroupingCollector and SearchGroup for Solr's group.skip.second.step use as outlined above.

Flagged up by that sketching, and not yet fully explored, is the question of whether or not the exclusion of group.func is actually needed, superficially it seemed it might not be i.e. the federator is just merging groups and sorting by id(s) and where the ids came from might not matter.

dnhatn added a commit to dnhatn/lucene-solr that referenced this pull request Sep 15, 2021
The first documents of subsequent segments are mistakenly skipped when
sort optimization is enabled. We should initialize maxDocVisited in
NumericComparator to -1 instead of 0.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants