You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
kNN search for data set size > 2G elements seem to go crazy :D
I was running kNN for k=1000, and data set size = 5,000,000,000 elements.
py4j.protocol.Py4JJavaError: An error occurred while calling z:com.astrolabsoftware.spark3d.spatialOperator.SpatialQuery.KNN.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 556 in stage 0.0 failed 4
times, most recent failure: Lost task 556.3 in stage 0.0 (TID 967, 134.158.75.162, executor 3):
java.lang.IllegalArgumentException: Comparison method violates its general contract!
at java.util.TimSort.mergeHi(TimSort.java:899)
at java.util.TimSort.mergeAt(TimSort.java:516)
at java.util.TimSort.mergeForceCollapse(TimSort.java:457)
at java.util.TimSort.sort(TimSort.java:254)
at java.util.Arrays.sort(Arrays.java:1512)
at com.google.common.collect.Ordering.leastOf(Ordering.java:708)
at com.astrolabsoftware.spark3d.utils.Utils$.com$astrolabsoftware$spark3d$utils$Utils$$takeOrdered(Utils.scala:174)
at com.astrolabsoftware.spark3d.utils.Utils$$anonfun$1.apply(Utils.scala:154)
at com.astrolabsoftware.spark3d.utils.Utils$$anonfun$1.apply(Utils.scala:152)
at org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$23.apply(RDD.scala:800)
at org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$23.apply(RDD.scala:800)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
at org.apache.spark.scheduler.Task.run(Task.scala:109)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:345)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
Note, the same problem appears regardless we want distinct objects or not.
My best guess is that we would to trade integer for long.
ADDED:
Interesting to note though, this does not happen in the pure Scala version.
The difference which comes into my mind is the default level of storage for RDD (None in Scala, MEMORY_ONLY in python).
The text was updated successfully, but these errors were encountered:
pyspark3d issue.
kNN search for data set size > 2G elements seem to go crazy :D
I was running kNN for k=1000, and data set size = 5,000,000,000 elements.
Note, the same problem appears regardless we want distinct objects or not.
My best guess is that we would to trade integer for long.
ADDED:
Interesting to note though, this does not happen in the pure Scala version.
The difference which comes into my mind is the default level of storage for RDD (None in Scala, MEMORY_ONLY in python).
The text was updated successfully, but these errors were encountered: