You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Hi,
I am using the Minhasher32 to create clusters of similar records, tokenizing the records values to create the signatures (as I explained here #609), but seems that the resulting buckets depends by the Spark configuration.
I executed the same code on a single node of a cluster machine with 16 cores more times and I always obtained X number of buckets. Than on the same machine I aumented the number of cores to 20, and the number of buckets it is changed to another number Y, I repeated the test and I obtained Y again.
It is possible that the execution of the MinHasher is influenced by the number of nodes? Someone it is able to explain me why?
Thanks
Regards
Luca
The text was updated successfully, but these errors were encountered:
I confirm that the bucket generation depends by the level of Spark parallelism.
I made a test on my laptop, repartitioning the token before initializing the MinHasher
val attributeWithHashes: RDD[(String, Iterable[MinHashSignature])] = attributesToken.repartition(10).map {
case (attribute, token) =>
(attribute, minHasher.init(token))
}.groupByKey()
At the same level of repartition I always obtains the same buckets, if I change it, I obtain different buckets.
Hi,
I am using the Minhasher32 to create clusters of similar records, tokenizing the records values to create the signatures (as I explained here #609), but seems that the resulting buckets depends by the Spark configuration.
I executed the same code on a single node of a cluster machine with 16 cores more times and I always obtained X number of buckets. Than on the same machine I aumented the number of cores to 20, and the number of buckets it is changed to another number Y, I repeated the test and I obtained Y again.
It is possible that the execution of the MinHasher is influenced by the number of nodes? Someone it is able to explain me why?
Thanks
Regards
Luca
The text was updated successfully, but these errors were encountered: