-
Notifications
You must be signed in to change notification settings - Fork 131
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Refactor Around Mapper and Mapping #1939
Refactor Around Mapper and Mapping #1939
Conversation
2dda3a8
to
90530a9
Compare
90530a9
to
2bb180c
Compare
@@ -251,62 +251,6 @@ public void testAddKNNBinaryField_fromScratch_nmslibCurrent() throws IOException | |||
assertNotEquals(0, (long) KNNGraphValue.MERGE_TOTAL_SIZE_IN_BYTES.getValue()); | |||
} | |||
|
|||
public void testAddKNNBinaryField_fromScratch_nmslibLegacy() throws IOException { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Deleting because Codec doesnt need to be aware of legacy
2bb180c
to
25a0f5a
Compare
* @param modelId Model id used to create the index; null if not created from model | ||
* @param dimension Dimension used to create the index; needs to be null for model-based indices | ||
*/ | ||
public ANNConfig(ANNConfigType annConfigType, KNNMethodContext knnMethodContext, String modelId, Integer dimension) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Using builder pattern could be better?
Other option is making each type of config as separate class and have its own validation there. Then create them using factory pattern. This will make it clear which validation is for which config type.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 on the builder pattern to create ANNConfig object. Now should we use lombok builder pattern or not, I would leave that upto @jmazanec15.
Other option is making each type of config as separate class and have its own validation there.
I would really like validation on the objects present in their own classes, but one thing I am not sure here is does a specific value of lets say KNNMethodContext
impact ANNConfigType
validations. Then it can become tricky. I would like to understand more on this before we move to more complex way of creating and validating objects.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Another point, why modelId and dimension are not encapsulated in classes? when spaceType will come will that be another parameter to this constructor? @jmazanec15 would like to know your thoughts on this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Dimension and modelId are direct top-level parameters. Want to hold off for this PR in wrapping these in anything. In the future, we can wrap if we need to.
And yes, after some thought, I am going to make ANNConfig -> KNNMappingConfig and implement it per FieldMapper. This should simplify things.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Switched to make ANNConfig an interface "KNNMappingConfig" and removed ANNConfigType. Now, the individual field mappers will construct KNNMappingConfig as they need to. I think this resolves these comments.
* | ||
* @return Optional containing the modelId if created from model, otherwise empty | ||
*/ | ||
public Optional<String> getModelId() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not having modelId as Optional object instead of String?
Is this to prevent the value from being updated?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The main reason around having Optional was to force calling code to handle null case. Otherwise, the calling code has to do their own null checks
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Then, why don't we have it like
private Optional<String> modelId;
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is message from intellij warning:
'Optional<String>' used as type for field 'modelId'
Inspection info: Reports any cases in which java.util.Optional<T>, java.util.OptionalDouble, java.util.OptionalInt, java.util.OptionalLong, or com.google.common.base.Optional are used as types for fields or parameters.
Optional was designed to provide a limited mechanism for library method return types in which a clear way to represent "no result" was needed.
Using a field with the java.util.Optional type is also problematic if the class needs to be Serializable, as java.util.Optional is not serializable.
Optional was designed to provide a limited mechanism for library method return types in which a clear way to represent "no result" was needed.
This is main purpose for this I felt
src/main/java/org/opensearch/knn/index/mapper/KNNVectorFieldMapperUtil.java
Outdated
Show resolved
Hide resolved
src/main/java/org/opensearch/knn/index/mapper/KNNVectorFieldMapperUtil.java
Outdated
Show resolved
Hide resolved
src/main/java/org/opensearch/knn/index/mapper/KNNVectorFieldMapperUtil.java
Show resolved
Hide resolved
src/main/java/org/opensearch/knn/index/query/KNNQueryBuilder.java
Outdated
Show resolved
Hide resolved
maybeInitLazyVariables(modelMetadata); | ||
} | ||
|
||
private void maybeInitLazyVariables(ModelMetadata modelMetadata) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldn't we just have a separate class and handle validation and processing there dynamically based on values in modelMetadata?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure I understand the suggestion
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Any specific reason of this style of atomic thing? Does this method called multiple time? Also, are these variables used multiple times and that is why we need to keep all these variable in instance level?
Can it be
private static final HashSet<VectorDataType, PerDimensionValidator> VALIDATOR = ...
validator.get(vectorDataType).validate();
Or
private static final PerDimensionValidator perDimensionValidator;
perDimensionValidator.validate(vectorDataType);
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The problem with the model field mapper is that it relies on cluster state to store the model id. But during recovery, the cluster state is not available. So we cannot get the model metadata during field mapper creation. Related PR: #111. So, the idea is that we have to set these onces field parsing begins. The reason for atomics is to ensure they get set once safely.
25a0f5a
to
048f6b5
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for doing the refactoring @jmazanec15 . When I was reviewing a new update got added. So not sure if all my comments are still valid. But if things are fixed please let me know. I will take another pass at the code.
High level comment:
- There are some complex conditions added in the KNNVectorFieldMapper I would love to see them getting simplified in an easy to read/understand fashion.
src/main/java/org/opensearch/knn/index/mapper/ANNConfigType.java
Outdated
Show resolved
Hide resolved
src/main/java/org/opensearch/knn/index/mapper/ANNConfigType.java
Outdated
Show resolved
Hide resolved
* @param modelId Model id used to create the index; null if not created from model | ||
* @param dimension Dimension used to create the index; needs to be null for model-based indices | ||
*/ | ||
public ANNConfig(ANNConfigType annConfigType, KNNMethodContext knnMethodContext, String modelId, Integer dimension) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 on the builder pattern to create ANNConfig object. Now should we use lombok builder pattern or not, I would leave that upto @jmazanec15.
Other option is making each type of config as separate class and have its own validation there.
I would really like validation on the objects present in their own classes, but one thing I am not sure here is does a specific value of lets say KNNMethodContext
impact ANNConfigType
validations. Then it can become tricky. I would like to understand more on this before we move to more complex way of creating and validating objects.
* @param modelId Model id used to create the index; null if not created from model | ||
* @param dimension Dimension used to create the index; needs to be null for model-based indices | ||
*/ | ||
public ANNConfig(ANNConfigType annConfigType, KNNMethodContext knnMethodContext, String modelId, Integer dimension) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Another point, why modelId and dimension are not encapsulated in classes? when spaceType will come will that be another parameter to this constructor? @jmazanec15 would like to know your thoughts on this.
// because to get the cluster metadata, you need access to the cluster state. Because this code is | ||
// sometimes used to initialize the cluster state/update cluster state, we cannot get the state here | ||
// safely. So, we are unable to validate the model. The model gets validated during ingestion. | ||
boolean isResolvedNull = resolvedKNNMethodContext == null; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we add some comments here what does this resolvedKNNMethodContext
mean.
// case, the input resolvedKNNMethodContext will be null and the settings wont exist (so flat mapper should | ||
// be used). Otherwise, we need to check the setting. | ||
boolean isSettingPresent = KNNSettings.IS_KNN_INDEX_SETTING.exists(context.indexSettings()); | ||
if (isResolvedNull && (!isSettingPresent || !KNNSettings.IS_KNN_INDEX_SETTING.get(context.indexSettings()))) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we simplify this check. Its really hard to follow and will be hard to make a sense after few months.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let me try. It is a little tricky to simplify because of upstream issue where merge does not have access to settings (see https://github.com/opensearch-project/OpenSearch/blob/main/server/src/main/java/org/opensearch/index/mapper/ParametrizedFieldMapper.java#L130 and opendistro-for-elasticsearch/k-NN#288)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please take a shot at it. The condition may look easy to read right now, but in next few months it will be hard. Or give an extensive java docs on this so that it is easy to understand.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I dont think it can be further simplified. I added a fairly large comment here: https://github.com/jmazanec15/k-NN-1/blob/field-mapper-ref/src/main/java/org/opensearch/knn/index/mapper/KNNVectorFieldMapper.java#L174-L188. Note that a similar problem exists with the legacy field mapper today (see https://github.com/opensearch-project/k-NN/blob/main/src/main/java/org/opensearch/knn/index/mapper/KNNVectorFieldMapper.java#L181-L196)
src/main/java/org/opensearch/knn/index/mapper/KNNVectorFieldMapperUtil.java
Outdated
Show resolved
Hide resolved
src/main/java/org/opensearch/knn/index/mapper/VectorValidator.java
Outdated
Show resolved
Hide resolved
* @param vectorDataType data type of the vector | ||
* @param annConfig configuration context for the ANN index | ||
*/ | ||
public KNNVectorFieldType(String name, Map<String, String> meta, VectorDataType vectorDataType, ANNConfig annConfig) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
public KNNVectorFieldType(String name, Map<String, String> meta, VectorDataType vectorDataType, ANNConfig annConfig) { | |
public KNNVectorFieldType(String name, Map<String, String> metadata, VectorDataType vectorDataType, ANNConfig annConfig) { |
public final class ANNConfig { | ||
|
||
@Getter | ||
private final ANNConfigType annConfigType; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Rather than a type here, is it worth simplifying it to have a boolean requiresTraining()
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is a third case where flatmapper when knn=false where it is not model based or method based. Working on cleaning this up a little bit.
private final KNNMethodContext knnMethodContext; | ||
private final String modelId; | ||
private final Integer dimension; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
-
KNNMethodContext
here is dense with layers in it. For instance model based context has KNNEngine inside of model metadata and in a layer above otherwise, this causes if else checks to getKNNEngine
. Unwrapping KNNMethodContext and storing specific values will isolate different cases and logic associated with it while setting it rather than during get. -
Is it worth considering to have specific configuration with hard types derived from ANNConfig. There is a case here where if modelId is present then dimension is -1. We can model it in a way where both fields are not in the same implementation. I am not sure how much this will help
Note: Please look around for backward compatibility with the approaches suggested
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
KNNMethodContext here is dense with layers in it. For instance model based context has KNNEngine inside of model metadata and in a layer above otherwise, this causes if else checks to get KNNEngine. Unwrapping KNNMethodContext and storing specific values will isolate different cases and logic associated with it while setting it rather than during get.
I prefer right now to keep it simple and encapsulate the 3 values from the mapping that can have null behavior or null-like behavior. I think this is something that can be circled back to in future.
Is it worth considering to have specific configuration with hard types derived from ANNConfig. There is a case here where if modelId is present then dimension is -1. We can model it in a way where both fields are not in the same implementation. I am not sure how much this will help
Yes, I think I agree with this. Im going to switch ANNConfig to KNNMappingConfig and make it an interface.
9de1292
to
b46d203
Compare
Refactors FieldMapper logic. It removes the LegacyFieldMapper and replaces it with a FlatFieldMapper. The FlatFieldMapper's role is to create fields that do not build ANN indices. Additionally, it puts dimension, model_id, and knn_method_context in a new ANNConfig class and adds some safety checks around accessing them. This should make calling logic easier to handle. Lastly, it cleans up the parsing so that there isnt encoder parsing directly in the KNNVectorFieldMapper. Signed-off-by: John Mazanec <[email protected]>
b46d203
to
0e1cd5b
Compare
} | ||
int expectedDimensions = knnVectorFieldType.getKnnMappingConfig() | ||
.getDimension() | ||
.orElseGet( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we make knnMappingConfg read dimension from model?
This code is mixing up something. For example, getDimension is meaningless for model config. Then, getDimensionFromModelId is also meaningless for non model config but this code does not tell that and engineer will not be able to catch that internal logic from this code itself.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I moved get dimension to be non-optional
Signed-off-by: John Mazanec <[email protected]>
5e61e3f
to
a85ea01
Compare
Signed-off-by: John Mazanec <[email protected]>
a85ea01
to
e22fcf8
Compare
Signed-off-by: John Mazanec <[email protected]>
Signed-off-by: John Mazanec <[email protected]>
Refactors FieldMapper logic. It removes the LegacyFieldMapper and replaces it with a FlatFieldMapper. The FlatFieldMapper's role is to create fields that do not build ANN indices. Additionally, it puts dimension, model_id, and knn_method_context in a new KNNMappingConfig class and adds some safety checks around accessing them. This should make calling logic easier to handle. Lastly, it cleans up the parsing so that there isnt encoder parsing directly in the KNNVectorFieldMapper. Signed-off-by: John Mazanec <[email protected]> (cherry picked from commit 2cd57e8)
Refactors FieldMapper logic. It removes the LegacyFieldMapper and replaces it with a FlatFieldMapper. The FlatFieldMapper's role is to create fields that do not build ANN indices. Additionally, it puts dimension, model_id, and knn_method_context in a new KNNMappingConfig class and adds some safety checks around accessing them. This should make calling logic easier to handle. Lastly, it cleans up the parsing so that there isnt encoder parsing directly in the KNNVectorFieldMapper. Signed-off-by: John Mazanec <[email protected]> Signed-off-by: Akash Shankaran <[email protected]>
Description
Refactoring of the current FieldMapper structure with the main motivation that simplification will make easier to make changes with confidence. A good amount of complexity has built up given that there are multiple ways to create an index (LegacyFieldMapper, Model, Lucene and Method) with different data types. I wanted to re-organize so that we can extend easier.
In general, this refactor does the following:
One note - I did leave out changing too much around the VectorDataType. I think this should eventually go in the ANNConfig class, but will leave for another time.
Check List
--signoff
.By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.