-
Notifications
You must be signed in to change notification settings - Fork 2.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
SOLR-11831: Skip second grouping step if group.limit is 1 (aka Las Vegas Patch) #300
base: master
Are you sure you want to change the base?
Conversation
boolean byRelevanceOnly = false; | ||
SortField[] sortFields = groupingSpec.getGroupSort().getSort(); | ||
|
||
if(sortFields != null && sortFields.length == 1 && sortFields[0] != null && sortFields[0].getComparator() instanceof FieldComparator.RelevanceComparator) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
SortField => getComparator(final int numHits, final int sortPos) .. differs in signature of the api..
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, the PR went out of sync, we will update it asap
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, the PR went out of sync ...
If you'd like bloomberg#228 pull request to merge into the https://github.com/bloomberg/lucene-solr/tree/SOLR-11831 branch (the branch behind this pull request here) has an interim 'make it compile (somehow)' change (also includes 'logger' to 'LOG' renames in this method).
} | ||
|
||
// TODO: At the moment the optimization does not support reranking | ||
if(isReranking) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will ltr not benefit from this optimization?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes! if this PR gets accepted, we will submit another one on LTR
@@ -233,6 +233,41 @@ public void prepare(ResponseBuilder rb) throws IOException | |||
} | |||
} | |||
|
|||
private boolean allowSkipSecondGroupingStep(final GroupingSpecification groupingSpec, final boolean isReranking ) { | |||
// Only possible if we only want one doc per group | |||
if (groupingSpec.getGroupLimit() != 1) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
minor: getGroupLimit()
is one of the deprecated GroupingSpecification
accessors; something like cpoerschke@5c227c4 could be a replacement (that commit also includes parameterised logging i.e. {}
instead of +
)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, Applied in eca5af7
* the Solr id and the score of the document */ | ||
public Object topDocSolrId; | ||
public float topDocScore; | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm wondering about additional object/GC footprint (for anyone who does not benefit from that) associated with the addition of three members here and opened https://issues.apache.org/jira/browse/LUCENE-8728 to explore further around that.
9ff670e
to
39e9ad2
Compare
@cpoerschke Thanks! @cpoerschke @mjosephidou I rebased the patch on top of the current master and applied @cpoerschke suggestions - precommit and unit tests succeeded. One thing that I want to discuss and fix is the value of |
if (rb.stage == ResponseBuilder.STAGE_EXECUTE_QUERY && rb.getGroupingSpec().isSkipSecondGroupingStep()) { | ||
shardRequestFactory = new StoredFieldsShardRequestFactory(); | ||
nextStage = ResponseBuilder.STAGE_DONE; | ||
rb.stage = ResponseBuilder.STAGE_GET_FIELDS; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Background re: existing code:
SearchHandler.handleRequestBody calls distributedProcess
on all configured components c
and via the
nextStage = Math.min(nextStage, c.distributedProcess(rb));
formula each component thus gets a say in what nextStage
is. The overall decision is then effected via the rb.stage = nextStage; assignment.
Observation re: this proposed QueryComponent.groupedDistributedProcess
change:
If QueryComponent
also assigns to rb.stage
then that could 'confuse' the SearchHandler
logic e.g. specifically components running after the QueryComponent
would not have an opportunity to do anything in the ResponseBuilder.STAGE_EXECUTE_QUERY
stage.
Question/Suggestion:
Might an alternative be to turn the ResponseBuilder.STAGE_EXECUTE_QUERY
stage into a 'no op' as far as the QueryComponent
(in isSkipSecondGroupingStep==true
circumstances) is concerned? bloomberg#229 aims to illustrate an alternative QueryComponent.groupedDistributedProcess
modification, changes elsewhere might be needed too (haven't looked into that yet).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @cpoerschke, I agree that we should skip stages in the middle, I merged your draft and later and I'm going to rerun the tests to see if it still works as intended or not.
/* If we are skipping the second grouping step we want to translate the response of the | ||
* first step in the response of the second step and send it to the get_fields step. | ||
*/ | ||
processSkipSecondGroupingStep(rb, mergedTopGroups, tempSearchGroupToShards, docIdToShard, groupSort, fields, groupField); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Two things and one question 'jumped out' at me when looking at this specific proposed change here:
(1) The new local docIdToShard
variable is allocated and accumulated on both if-and-else code paths but only used in the if-case.
(2) The processSkipSecondGroupingStep
takes seven arguments which (subjectively) seems a lot.
(3) question: There's no rb.mergedSearchGroups.put
and rb.searchGroupToShards
in the if-case -- why might that be and would that be okay?
Taking a closer look I learnt the following:
(4) In the processSkipSecondGroupingStep
method there is an rb.searchGroupToShards.put
call.
(5) Two of the processSkipSecondGroupingStep
arguments are contained in the rb
argument, namely groupSort
and fields
.
(6) One of the processSkipSecondGroupingStep
arguments i.e. tempSearchGroupToShards
is used only for the rb.searchGroupToShards.put
call. If that call could be factored out of the method then that would reduce the if-vs-else-case differences and save one method argument.
(7) The new local docIdToShard
variable has conceptual similarity to the existing tempSearchGroupToShards
variable i.e both map 'something' to a shard or shards:
(7a) docIdToShard
is a one-to-one mapping since a single document lives on exactly one shard
(7b) tempSearchGroupToShards
is a one-to-mapping mapping since a single group can contain multiple documents that (between them) live on multiple shards
Next then I did some exploratory refactoring to see if/how:
(8) unnecessary allocation and accumulation of docIdToShard
might be avoided for (existing) code paths that do not or cannot use the group.skip.second.step=true
optimisation
(9) the processSkipSecondGroupingStep
signature might be narrowed to take fewer than seven arguments
What emerged was a deceptively simple idea:
(10) replace the Map<String, Map<SearchGroup<BytesRef>, Set<String>>> tempSearchGroupToShards;
local variable with a SearchGroupsContainer searchGroupsContainer;
local variable and give the SearchGroupsContainer
inner class methods that do what the code currently does i.e. just code refactoring
(11) for the group.skip.second.step=true
optimisation extend the SearchGroupsContainer
class and override methods as required
https://github.com/cpoerschke/lucene-solr/tree/github-bloomberg-SOLR-11831-cpoerschke-4 shares what emerged, it compiles but is totally untested:
- cpoerschke@6f53499 shows the differences relative to the branch behind the pull request here
- 5c14302...6f53499 shows the ('all in') differences relative to the master branch baseline code
What do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @cpoerschke for the analysis. Please find the answers:
(1) I agree, allocation can be avoided.
(2) I agree, I'll see if it is possible to reduce or wrap them in a private object
(3) rb.mergedSearchGroups
It is used only to generate the second group request
Line 114 in 1d85cd7
for (Map.Entry<String, Collection<SearchGroup<BytesRef>>> entry : rb.mergedSearchGroups.entrySet()) { |
... That we are going to skip if we apply the skip second step:
https://github.com/bloomberg/lucene-solr/blob/SOLR-11831/solr/core/src/java/org/apache/solr/handler/component/QueryComponent.java#L573
(4-7b) Yes
(10, 11) I Merged your changes, and I'll check that the tests are still fine, and that precommit is still happy
Thanks
@@ -34,17 +41,37 @@ | |||
/** | |||
* Implementation for transforming {@link SearchGroup} into a {@link NamedList} structure and visa versa. | |||
*/ | |||
public class SearchGroupsResultTransformer implements ShardResultTransformer<List<Command>, Map<String, SearchGroupsFieldCommandResult>> { | |||
public abstract class SearchGroupsResultTransformer implements ShardResultTransformer<List<Command>, Map<String, SearchGroupsFieldCommandResult>> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
An existing public class becoming abstract
here 'jumped out' at me a little since it could be considered to be breaking backwards compatibility?
The cpoerschke@7e47e01 commit explores removal/replacement of the SearchGroupsResultTransformer.getInstance
wrapper and a next step could be for the proposed SearchGroupsResultTransformer.SkipSecondStepSearchResultResultTransformer
inner class to become a SkipSecondStepSearchResultResultTransformer.java
class of its own and the proposed SearchGroupsResultTransformer.DefaultSearchResultResultTransformer
inner class to not be created i.e. SearchGroupsResultTransformer
here to not become abstract and its constructor to remain public. This of course assumes that SearchGroupsResultTransformer
would require only simple changes to make it extensible and that then SkipSecondStepSearchResultResultTransformer
could extend it.
(Note that the cpoerschke@7e47e01 commit is on https://github.com/cpoerschke/lucene-solr/tree/github-bloomberg-SOLR-11831-cpoerschke-5 branch and that branch includes https://github.com/cpoerschke/lucene-solr/tree/github-bloomberg-SOLR-11831-cpoerschke-4 changes for convenience only i.e. not out of necessity.)
What do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks a lot better, I merged it with the previous commit.
One comment: instead of having SearchGroupsResultTransformer
implementing the default behavior and SkipSecondStepSearchResultResultTransformer, I would compose the behavior: One class for the default
behaviur, and one for the
skipsStep` behavior. I'll try to sketch it in a PR to this branch.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
... instead of having
SearchGroupsResultTransformer
implementing the default behavior and SkipSecondStepSearchResultResultTransformer, I would compose the behavior ...
What would happen to SearchGroupsResultTransformer
itself? Just asking with backwards compatibility in mind.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
SearchGroupsResultsTransformer
shouldn't change, it would just delegate the behaviour to a different class according to the skip
flag.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
bloomberg#231 sketches the idea, what do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @cpoerschke I applied you commit (with minor changes) in 9480482 and attempted a similar approach for transformToNative
in 0eb206b. Now there is no duplication, but I think we still need to revisit [Default|SkipSecondStep]SearchResultResultTransformer
... I'm not convinced about SkipSecondStep inheriting and overriding Default
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @diegoceccarelli for applying the serialise-one-group commit and for attempting a similar approach for the deserialisation! As you say, now there is no duplication though returning to this after a little while it seems to me at least a little bit tricky to understand where the boundaries are between the refactoring and the changes related to the skip-second-step logic i.e. to see that the refactoring does not break anything and to see that and why the skip-second-step logic will work as intended. Against that background https://issues.apache.org/jira/browse/SOLR-13585 proposes to do a pure factoring out of two methods first (albeit with foresight of the envisaged code changes here) and then hopefully the skip-second-step logic addition would be clearer to see. Admittedly though such a split approach would make for a mildly messy rebase for the PR branch here. What do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @cpoerschke https://issues.apache.org/jira/browse/SOLR-13585 Looks good to me, I would merge it and then rebase this code on top of it..
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @diegoceccarelli for the SOLR-13585 review! I'll aim to commit it on Monday then. Rebasing on top of that will also pick up other grouping related changes that happened recently-ish i.e. they will be good to have as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
SOLR-13585 is now committed to master and branch_8x branches; apologies for the delay, the 8.2.0 release process was already underway and I was waiting for the 8.3.0 section to be ready in the solr/CHANGES.txt file.
b78c99e
to
f4194ea
Compare
@@ -67,8 +67,14 @@ public void transform(Map<String, ?> result, ResponseBuilder rb, SolrDocumentSou | |||
for (GroupDocs<BytesRef> group : topGroups.groups) { | |||
SimpleOrderedMap<Object> groupResult = new SimpleOrderedMap<>(); | |||
if (group.groupValue != null) { | |||
final String groupValue; | |||
if (rb.getGroupingSpec().isSkipSecondGroupingStep()) { | |||
groupValue = groupField.getType().indexedToReadable(group.groupValue.utf8ToString()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Q: So on this bit here I was wondering why the indexedToReadable
is needed in the isSkipSecondGroupingStep
case.
A: The TopGroupsResultTransformer which would be called in the not-isSkipSecondGroupingStep
case has it, just the indexedToReadable
that is. And the groupFieldType.toObject(groupField.createField(...))
wrap in this method is needed to 'have' e.g. the name part of the group field i.e. not just the group value stored in the groupResult
map. Something along those lines.
Anyhow, then I wondered about whether or not the GroupedEndResultTransformer
"needs to know" about the GroupingSpecification.isSkipSecondGroupingStep
logic. cpoerschke@ce5d9a9 commit sketches how the "knowing" might be avoided, though I'm not entirely convinced by the approach for three reasons:
- some difficulty in explaining (as evidenced by lack of javadocs) what the new
GroupedEndResultTransformer.groupValueToString
method really does and why it's beneficial, - the realisation that
GroupedEndResultTransformer
already "knows" aboutGroupingSpecification
itself i.e. there are twodocList.setStart(rb.getGroupingSpec().getWithinGroupOffset());
calls, and - the observation that after the change all
GroupingSpecification.isSkipSecondGroupingStep
callers are inQueryComponent
which raises a "is it a grouping parameter or a query component parameter?" question.
Either way, an interesting question arose from the exercise:
- Making the
QueryComponent
change in cpoerschke@ce5d9a9 brought attention to thegroup.main=true
andgroup.format=simple
Grouping Parameters code paths.- The
EndResultTransformer
code for those parameters is apparently not affected, and intuitively it would seem that it need not be (since the group value is not(?) included in the output) whilst still benefitting from theisSkipSecondGroupingStep
logic. Though it might be nice to investigate that further and perhaps add test coverage for it?
- The
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
added regression test for group.main=true
and group.format=simple
in 60d6052
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
wrt cpoerschke/lucene-solr@ce5d9a9 , I wouldn't do that change - the check rb.getGroupingSpec().isSkipSecondGroupingStep()
is just moved into QueryComponent
, but it affects only the behavior in GroupedEndResultTransformer.transform
so I would keep it there..
/** activates optimization in case only one document per group. | ||
* Setting this to true is only compatible with group.limit = 1 | ||
*/ | ||
public static final String GROUP_SKIP_DISTRIBUTED_SECOND = GROUP + ".skip.second.step"; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about using this GROUP_SKIP_DISTRIBUTED_SECOND
constant more widely throughout the code e.g. when logging and in tests? cpoerschke@6bdf872 commit sketches how it might be done. This could help code readers locate code related to the optimisation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done in c5553ed
|
||
groupingSpec.setSkipSecondGroupingStep(params.getBool(GroupParams.GROUP_SKIP_DISTRIBUTED_SECOND, false)); | ||
boolean isReranking = (rb.getRankQuery() != null); | ||
if (groupingSpec.isSkipSecondGroupingStep() & !allowSkipSecondGroupingStep(groupingSpec.getWithinGroupSortSpec(), groupingSpec.getGroupSortSpec(), isReranking)){ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
observation: allowSkipSecondGroupingStep
currently does 'error' logging in response to illegal grouping specifications in a request. this probably should not be 'error' or 'warn' logging (since there is nothing 'wrong' with Solr itself) but 'info' logging at most or no logging at all with the details returned in the Illegal grouping specification
that is thrown here. An alternative could also be to 'info' log and not throw but to just ignore the group.skip.second.step=true
parameter if it's invalid.
edit: a little further below in the class there's a cmd.setSegmentTerminateEarly(false); // not supported, silently ignore any segmentTerminateEarly flag
for example
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done in 725bfc8
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The QueryComponent.allowSkipSecondGroupingStep
method being passed and (almost) only using two of the GroupingSpecification
's parameters made me wonder if the allow logic might be moved into a validate()
or similarly named method in that class -- cpoerschke@70e9c19 has a sketch -- the logic being there would make it potentially available also to other non-QueryComponent
plugins and the explicit 'optionally call validate after all the setters' aspect might also be helpful in terms of code comprehension. What do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like it, but what about validate
being void
and throwing a SolrException
in case of problems? in this case we can return the description of the particular problem in the exception and we just call validate
at the end of the setup. what do you think ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, void
and throwing would work too, good idea, as you say more details can be returned that way.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done in 0563e66
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
edit: a little further below in the class there's a cmd.setSegmentTerminateEarly(false); // not supported, silently ignore any segmentTerminateEarly flag for example
Checking for this and failing if it is true will modify the current behavior - do you think we should fail?
} | ||
|
||
// Within group sort must be the same as group sort because if we skip second step no sorting within group will be done. | ||
if (withinGroupSpecification.getSort() != groupSort.getSort()) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
question/thought: is full equality required technically speaking or would prefix equality be sufficient? e.g. if the sorting within the group was by X,Y,Z
and the sorting of groups was by X,Y
. that way around in terms of prefixes, i think, or maybe the other way around.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree, done in c5024d0
@Override | ||
public void addSearchGroupToShards(ResponseBuilder rb, String groupField, Collection<SearchGroup<BytesRef>> mergedTopGroups) { | ||
super.addSearchGroupToShards(rb, groupField, mergedTopGroups); | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So today I was seeking to understand the logic in this SkipSecondStepSearchGroupShardResponseProcessor.addSearchGroupToShards
method more.
Analysis:
- In the existing i.e.
group.skip.second.step=false
code paths QueryComponent.handleGroupedResponses logic involves theSearchGroupShardResponseProcessor
andTopGroupsShardResponseProcessor
classes for the first and second steps respectively. - In the new i.e.
group.skip.second.step=true
code paths there is no second step and so some of whatTopGroupsShardResponseProcessor
would do if it was called needs to be done elsewhere. - On the
group.skip.second.step=true
code path theSkipSecondStepSearchGroupShardResponseProcessor
is used instead of theSearchGroupShardResponseProcessor
and so structurally there already is a possibility for the former to 'do more' than the latter.
Suggestion:
- Factoring out of a static
TopGroupsShardResponseProcessor.fillResultIds
method to reduce code duplication and help ensure both code paths (continue to) use the same result ids filling logic (going forward), the cpoerschke@1570f7c commit has a sketch.
What do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree, I applied your change (removing public
from TopGroupsShardResponseProcessor.fillResultIds
, package visibility is enough)
commit a7df0c0
final String shard = docIdToShard.get(group.topDocSolrId); | ||
assert(shard != null); | ||
ShardDoc sdoc = new ShardDoc(group.topDocScore, | ||
fields, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The passing of fields
to the ShardDoc
here seems incorrect to me:
ShardDoc
is aFieldDoc
and per https://github.com/apache/lucene-solr/blob/releases/lucene-solr/8.0.0/lucene/core/src/java/org/apache/lucene/search/FieldDoc.java#L43-L51 it expects "... values which are used to sort ..." to be stored in that field (e.g. 2019).- Instead here
GroupingSpecification.getFields()
is passed (e.g.publicationYear
) i.e. a field name rather than a field value.
On the group.skip.second.step=true
code path that particular argument is not actually used in practice, hence this name-vs-value difference can easily go unnoticed. Something like commit cpoerschke@2b3069b would be a way to use the other ShardDoc
constructor.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, added in 72f559a
d717987
to
173c013
Compare
@cpoerschke I think I addressed all your comments, let me know if you have more comments :) Thanks! |
@@ -232,6 +233,47 @@ public void prepare(ResponseBuilder rb) throws IOException | |||
} | |||
} | |||
|
|||
// check if prefix is a prefix of the array | |||
private static boolean isPrefix(SortField[] prefix, SortField[] array){ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's an existing ArrayUtil class, I wonder if it might welcome a (generic) version of this method. What do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Given the fact that it is used only here I would keep it private - and we might put it into ArrayUtil
when another class needs it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure. Yes, it's private and static but it seems a relatively long amount of what seems quite generic code and when another class needs something like it then it would be tricky to discover that the method already exists here and is available for relocation?
Having looked around a little, it looks like there's a Collections.indexOfSubList method in Java and if perhaps array-to-list conversion was acceptable then something like
0 == Collections.indexOfSubList(Arrays.asList(array), Arrays.asList(prefix))
might be another alternative?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah! I think so, let me try!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done in 0563e66
|
||
// Within group sort must be the same as group sort because if we skip second step no sorting within group will be done. | ||
if (! isPrefix(withinGroupSortFields, groupSortFields)) { | ||
System.out.println("Not prefix" + Arrays.toString(withinGroupSortFields)+ " , "+Arrays.toString(groupSortFields) ); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just FYI the latest ant precommit
checks for print statements in production code. Suggest to either remove this one or to replace it with log.info
equivalent. If logging here then perhaps the other return false
should also log?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removed (95ed12a), thanks. I would just throw the exception from the caller.
} | ||
|
||
@Override | ||
protected Object serializeOneSearchGroup(SearchGroup<BytesRef> searchGroup, SearchGroupsFieldCommand command) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cpoerschke@af1a746 has potential (untested) tweaks for this SkipSecondStepSearchResultResultTransformer.serializeOneSearchGroup
method such as making uniqueField
a class member to avoid per-document get-schema-then-get-unique-key-field calls and replacement of the DocumentStoredFieldVisitor
-based document retrieval with the simpler equivalent of TopGroupsResultTransformer.retrieveDocument -- that would be equivalent, at least at a quick glance, unless perhaps the DocumentStoredFieldVisitor
use has intended side effects such as cache warning (in which case maybe comments at point of use might be helpful for future readers of the code).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks 108faf8, I think that NamedList(5)
was due to a previous solution that was using 5 fields, I just used the empty constructor for simplicity.
|
||
static void fillResultIds(ResponseBuilder rb){ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
https://issues.apache.org/jira/browse/SOLR-13576 opened for this fairly independent refactoring here.
dd17b86
to
4c7e5f1
Compare
@cpoerschke I rebased on top of the current master, please note that:
|
Thanks @diegoceccarelli!
Yes, that sounds right. The |
Thanks again @diegoceccarelli for rebasing on top of the current master! With the serialise/deserialiseOneGroup methods in place the My cpoerschke@20129e7 commit has a couple of tweaks and suggestions, lumped together as one commit (sorry!) but do feel free to look and selectively pick as you like. I'll also go and annotate on the PR here the suggested tweaks that are not related to code comprehension or style consistency. And then next (not today) I'm planning to look at the tests; the |
sdoc.id = group.topDocSolrId; | ||
sdoc.shard = shard; | ||
|
||
groups[index++] = new GroupDocs<>(group.topDocScore, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Passing Float.NaN
for the first (score
) argument looks to be possible and seems more accurate here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm a bit concerned about using NaN
because NaN cannot be compared with any floating type value I'll have to double check that we are not sorting on that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, after reviewing the code I agree that is OK and better to use NaN
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, I double checked the code and it looks safe, merged - thank you
0, /*Set totalHitCount to 0 as we can't computed it as is */ | ||
0, /*Set totalGroupedHitCount to 0 as we can't computed it as is*/ | ||
groups, | ||
maxScore); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cpoerschke@20129e7 proposes to initialise maxScore
to Float.MIN_VALUE
and to pass Float.NaN
here if groups
happened to be empty.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I added a test to check that the result if ok in case of 0 results
* to be unique so this is what we need to return to the federator | ||
**/ | ||
try { | ||
luceneDoc = searcher.doc(searchGroup.topDocLuceneId, Collections.singleton(uniqueField.getName())); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cpoerschke@20129e7 proposes to introduce a uniqueFieldNameAsSet
member to avoid successive calls allocating identical singleton sets.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good, merged
Thanks @diegoceccarelli for the updates and for splitting the unrelated |
…gas patch) Summary: In cases where we do grouping and ask for {{group.limit=1}} only it is possible to skip the second grouping step. In our test datasets it improved speed by around 40%. Essentially, in the first grouping step each shard returns the top K groups based on the highest scoring document in each group. The top K groups from each shard are merged in the federator and in the second step we ask all the shards to return the top documents from each of the top ranking groups. If we only want to return the highest scoring document per group we can return the top document id in the first step, merge results in the federator to retain the top K groups and then skip the second grouping step entirely. QueryComponent: interim 'make it compile (somehow)' change (#228) add SearchGroupsContainer (#230) factor out SearchGroupsResultTransformer.serializeOneSearchGroup method (Christine) Refactor transformToNative adding deserializeOneSearchGroup increase GroupParams.GROUP_SKIP_DISTRIBUTED_SECOND use (see also 6bdf87) Remove error logging in allowSkipSecondGroupingStep Check that withinGroupSort is a prefix of groupSort SkipSecondStepSearchGroupShardResponseProcessor.addSearchGroupToShards now leaves ShardDoc.fields null factor out TopGroupsShardResponseProcessor.fillResultIds method Add regression test group.main=true and group.format=simple Improve GroupingSpecification validation adding validate() method SkipSecondStepSearchResultResultTransformer.serializeOneSearchGroup tweaks
…roupSkipSecondStep edits
8cbbb03
to
27c4355
Compare
@cpoerschke what do you think about removing the helper function:
and we replace the calls to |
Sounds good to me. |
if (comparator instanceof FieldComparator.RelevanceComparator){ | ||
searchGroup.topDocScore = (Float)comparator.value(group.comparatorSlot); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The searchGroup.topDocScore
being filled in conditionally here i.e. only if it's a field comparator 'jumps out' a little.
Have not yet looked into what other comparators there are and if removing of the conditional might be practical.
But this also re-surfaced the SearchGroup.java
question around adding or not adding additional fields and what (if any) alternative might be possible.
With this in mind https://github.com/cpoerschke/lucene-solr/commits/github-bloomberg-SOLR-11831-cpoerschke-13 explores the 'extend FirstPassGroupingCollector and SearchGroup' route a little further:
- presume LUCENE-8728 changes or equivalent are available
- add SolrFirstPassGroupingCollector and SolrSearchGroup classes
- on the
group.skip.second.step
code paths always have SolrSearchGroup instead of SearchGroup - not yet done:
- creation of SolrFirstPassGroupingCollector instead of FirstPassGroupingCollector if-and-only-if appropriate
- tests need to pass again
- consider if this would actually be comprehensible, maintainable and helpful e.g. w.r.t. (Solr)SearchGroup fields always being filled in.
What do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Taking the conversation back from the commits on the above branch to here: bloomberg#232 sketches extending Lucene's FirstPassGroupingCollector and SearchGroup for Solr's group.skip.second.step use as outlined above.
Flagged up by that sketching, and not yet fully explored, is the question of whether or not the exclusion of group.func is actually needed, superficially it seemed it might not be i.e. the federator is just merging groups and sorting by id(s) and where the ids came from might not matter.
The first documents of subsequent segments are mistakenly skipped when sort optimization is enabled. We should initialize maxDocVisited in NumericComparator to -1 instead of 0.
Summary:
In cases where we do grouping and ask for {{group.limit=1}} only it is possible to skip the second grouping step. In our test datasets it improved speed by around 40%.
Essentially, in the first grouping step each shard returns the top K groups based on the highest scoring document in each group. The top K groups from each shard are merged in the federator and in the second step we ask all the shards to return the top documents from each of the top ranking groups.
If we only want to return the highest scoring document per group we can return the top document id in the first step, merge results in the federator to retain the top K groups and then skip the second grouping step entirely.