On the partitioning: Garbage collector is huge while re-partitioning (2/3 of the total time) #73

JulienPeloton · 2018-07-18T13:45:39Z

OS: CentOS Linux release 7.4.1708 (Core)
spark3D: 0.1.4
spark-fits: 0.6.0

#72 adds a script to benchmark the partitioning. The idea is the following:

Load data using spark-fits (10 millions)
Apply partitioning or not to the RDD
Trigger an action, and repeat this several times (put in cache data at the first time)

Regardless the partitioning (octree or onion), the GC time is rather big compared to the compute time:

Octree (mapPartitions at Shape3DRDD.scala:164):

Metric	Min	25th percentile	Median	75th percentile	Max
Duration	48 s	48 s	48 s	48 s	48 s
GC Time	33 s	33 s	33 s	33 s	33 s

Onion (mapPartitions at Shape3DRDD.scala:164)

Metric	Min	25th percentile	Median	75th percentile	Max
Duration	46 s	46 s	46 s	46 s	46 s
GC Time	28 s	28 s	28 s	28 s	28 s

The code responsible of this is (Shape3DRDD.scala:142)

/**
    * Repartion a RDD[T] according to a custom partitioner.
    *
    * @param rdd : (RDD[T])
    *   RDD of T (must extends Shape3D) with any partitioning.
    * @param partitioner : (SpatialPartitioner)
    *   Instance of SpatialPartitioner or any extension of it.
    * @return (RDD[T]) Repartitioned RDD[T].
    *
    */
  def partition(partitioner: SpatialPartitioner)(implicit c: ClassTag[T]) : RDD[T] = {
    // Go from RDD[V] to RDD[(K, V)] where K is specified by the partitioner.
    // Finally, return only RDD[V] with the new partitioning.

    def mapElements(iter: Iterator[T]) : Iterator[(Int, T)] = {
      var res = ListBuffer[(Int, T)]()
      while (iter.hasNext) {
        res ++= partitioner.placeObject(iter.next).toList
      }
      res.iterator
    }

    rawRDD.mapPartitions(mapElements).partitionBy(partitioner).mapPartitions(_.map(_._2), true)

  }

We must investigate this.

The text was updated successfully, but these errors were encountered:

JulienPeloton added Partitioning performance labels Jul 18, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

On the partitioning: Garbage collector is huge while re-partitioning (2/3 of the total time) #73

On the partitioning: Garbage collector is huge while re-partitioning (2/3 of the total time) #73

JulienPeloton commented Jul 18, 2018

On the partitioning: Garbage collector is huge while re-partitioning (2/3 of the total time) #73

On the partitioning: Garbage collector is huge while re-partitioning (2/3 of the total time) #73

Comments

JulienPeloton commented Jul 18, 2018