-
Notifications
You must be signed in to change notification settings - Fork 578
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Added DynamicEmbedding RFC #446
base: master
Are you sure you want to change the base?
Added DynamicEmbedding RFC #446
Conversation
DynamicEmbedding( | ||
input_dim=5, | ||
output_dim=2, | ||
input_length=5, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we support the inputs with dynamic shapes?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
input to the layer can be dynamic but if you are asking if input_dim
which would be same as vocabulary size - this is not dynamic.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Excellent, thank you for your answer! I would like to know what the input_dim
means. From my understanding, input_dim
should be less or equal to the vocabulary size
, which is fixed when training going on, is it right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
input_dim should be vocabulary size
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for the clarification! If the input_dim
and vocabulary size
are not dynamic, some critical scenarios may not be supported. Some industrial scenarios of real dynamic embedding request the algorithm engineers to use uint64_t
for the encoded features which has a possible range of [0, std::numeric_limits<uint64_t>::max]
. That means the input_dim
and vocabulary size
should not be set cause it's almost unlimited.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@rhdong, I would like to clarify that for the layer initialization
inp_dim
is vocabulary size(tried to keep it consistent with the [embedding layer](https://www.tensorflow.org/api_docs/python/tf/keras/layers/Embedding#:~:text=.Embedding(-,input_dim%2C,-output_dim%2C%0A%C2%A0%20%C2%A0%20embeddings_initializer) - . The input to the layer can be of any dynamic shape.
Hi @divyashreepathihalli, thank you for your clarification. I understand now. About The input to the layer can be of any dynamic shape.
it total makes sense. But I'm afraid that the input_dim
setting would limit the features encoding space. In the dynamic embedding context(compared to the original static embedding in current TensorFlow), the input_dim
should be std::numeric_limits<uint64_t>::max
. I would try to explain it in one google doc. Before that, possibly you could refer to the TFRA API design that only the embedding_size
need to be configured (similar with out_dim
) https://github.com/tensorflow/recommenders-addons/blob/master/tensorflow_recommenders_addons/dynamic_embedding/python/keras/layers/embedding.py#L117
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@rhdong, I would like to clarify that for the layer initialization
inp_dim
is vocabulary size(tried to keep it consistent with the [embedding layer](https://www.tensorflow.org/api_docs/python/tf/keras/layers/Embedding#:~:text=.Embedding(-,input_dim%2C,-output_dim%2C%0A%C2%A0%20%C2%A0%20embeddings_initializer) - . The input to the layer can be of any dynamic shape.
Possibly, I think @divyashreepathihalli may a little misunderstand the meaning of dynamic shape embedding. For example, there is a training feature input that are both large-scale and sparse, such as USER_ID. If we apply the vocabulary method to USER_ID, it will only map USER_ID to the dimension of vocabulary size, which is a compression of the information dimension. Since the vocabulary size is fixed, this is still a static embedding. Dynamic embedding means that all inputs can be processed with a non-conflicting method through a hashmap. The size of the dynamic embedding is not fixed and is unpredictable because the USER_ID grows with the growth of the business.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Besides the example of USER_ID
by @MoFHeka, in our recommender system, we use user&item crossed features to enhance the accuracy and relevance of our recommendations. By combining multiple features into a unique identifier, we can create a more comprehensive representation of the relationship between users and items, resulting in better recommendations. When using tf.sparse.cross_hash
or xxhash
, a sparse key in the range of [0, std::numeric_limits<uint64_t>::max] is generated. For such a large-scale and sparse feature, a dynamic size is mandatory.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@rhdong @MoFHeka thank you for the clarification. I tried to read up further. If I understand correctly you are looking for a dynamic vocabulary size and a dynamic embedding matrix as well, correct? One that would keep growing?
As of now our scope of work will be limited to maintaining a fixed size vocabulary and fixed embedding size, updating the vocabulary based on inputs received by the layer and eviction policies. The embedding values will be remapped whenever the vocabulary is updated based on input patterns (most frequently seen input, TTL, etc). If the input key is not in the vocab it will be mapped to a default value, however we keep track of these keys and add it to the vocab when the updates are done in the callback(new keys are added in the vocab by kicking out old keys based on eviction policies specified)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@divyashreepathihalli It's my pleasure. Considering the practical scenario of dynamic embedding we reached out to, the hashtable-based dynamic vocabulary size would be a fundamental requirement. I guess one of the PROs of your current design is that there is no need to modify the tf.optimizer
; that makes sense, but in addition to the considerations we discussed above, I'm also a little worried it will introduce the data consistency issue caused by decoupling the embedding indexing
and embedding looking up
, especially in eviction involved. Applying atomic or lock mechanisms on ID and embedding is challenging when they are operated in two separate OPs.
|
||
The image below illustrates the workflow when the parameter server | ||
strategy is used. PSS supports asynchronous training. Each worker | ||
will have a copy of the vocabulary, which will be consistent across |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @divyashreepathihalli, may I have your confirmation here? If it means each worker will hold a full set of vocabulary that maps the vocab
to index
, and the real embedding vectors stored in some PSs with dense format(for example the tf.Variable
)? Am I correct? Thank you so much!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That is correct. Each worker should have a copy of the vocabulary( vocab->index mapping). The embedding variable will be split in distributed servers.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @divyashreepathihalli, thank you for your comment! If we have a full set copy of the key-index
mapping on each worker, there should be some upper limitations on the vocabulary size
. To my best knowledge, some vocabulary size
in some industrial scenarios can be tens or hundreds of billions, which causes the memory consumption on GPU/TPU to be significantly large and unbearable. One of the practical solutions is storing the key-value
in the format of an abstract hashtable in a distributed way like TFRA. Hope it's helpful. Thanks!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree with you. The proposed design would be the initial implementation and the distributed KV server would definitely be the way to go going forward.
Dynamic embedding is a very important feature for us. When training the sorting model that supports scenarios such as search, recommendation, and advertisement, we encountered the following problems:
With this feature, the main reasons:
|
In this design approach, the DynamicEmbedding layer is composed of two | ||
layers: the DynamicLookup layer and the Embedding layer. The | ||
DynamicLookup layer is responsible for the following tasks: | ||
* Maintaining a vocabulary table using an eviction policy that is |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We are currently using parameter size in volume about 1E13 bytes in production. Will it be very expansive to maintain vocabulary and indexes for large parameter?
updated based on input pattern. | ||
* Performing vocabulary lookup for the given input and returning | ||
integer indexes. | ||
* The index is then passed to the Embedding layer, which looks |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In many case, we don't know how many keys in a feature exactly, since the property of videos, images, commodity, video games, etc. are always in change. Preset a vocab/index range may lead to waste in storage or feature conflicts.
updates corresponding to evolving input patterns and vocabulary changes. | ||
### Goal | ||
* Works across accelerators (GPU / TPU) | ||
* Works with Parameter server strategy (asynchroous distributed training) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Typo: asynchronous
Added DynamicEmbedding RFC