How to avoid gc overhead limit exceeded in a range

I am using Spark 2.4.3 with the extension of GeoSpark 1.2.0.

I have two tables to join as range distance. One table (t1) if ~ 100K rows with one column only that is a Geospark's geometry. The other table (t2) is ~ 30M rows and it is composed by an Int value and a Geospark's geometry column.

What I am trying to do is just a simple:

    val spark = SparkSession
      .builder()
//      .master("local[*]")
      .config("spark.serializer", classOf[KryoSerializer].getName)
      .config("spark.kryo.registrator", classOf[GeoSparkKryoRegistrator].getName)
      .config("geospark.global.index", "true")
      .config("geospark.global.indextype", "rtree")
      .config("geospark.join.gridtype", "rtree")
      .config("geospark.join.numpartition", 200)
      .config("spark.sql.parquet.filterPushdown", "true")
//      .config("spark.sql.shuffle.partitions", 10000)
      .config("spark.sql.autoBroadcastJoinThreshold", -1)
      .appName("PropertyMaster.foodDistanceEatout")
      .getOrCreate()

GeoSparkSQLRegistrator.registerAll(spark)

spark.sparkContext.setLogLevel("ERROR")

spark.read
  .load(s"$dataPath/t2")
  .repartition(200)
  .createOrReplaceTempView("t2")

spark.read
  .load(s"$dataPath/t1")
  .repartition(200)
  .cache()
  .createOrReplaceTempView("t1")

val query =
  """
    |select /*+ BROADCAST(t1) */
    |  t2.cid, ST_Distance(t1.geom, t2.geom) as distance
    |  from t2, t1 where ST_Distance(t1.geom, t2.geom) <= 3218.69""".stripMargin

spark.sql(query)
  .repartition(200)
  .write.mode(SaveMode.Append)
  .option("path", s"$dataPath/my_output.csv")
  .format("csv").save()

I tried different configurations, cboth when I run it locally or on my local cluster on my laptop (tot mem 16GB and 8 cores) but without any luck as the program crashes at "Distinct at Join" for GeoSpark with lots of shuffling. However I am not able to remove the shuffling from SparkSQL syntax. I thought to add an extra column id on the biggest table as for example same integer every 200 rows or so and repartition by that, but didn't work too.

I was expecting a partitioner for GeoSpark indexing but I am not sure it is working.

Any idea?

标签： scala apache-spark apache-spark-sql geospark

1条回答

叛逆

2楼-- · 2019-08-21 21:40

I have found an answer myself, as the problem of the GC overhead was due to partitioning but also the memory needed for the Partitioner by GeoSpark (based on index) and the timeout due to long geoquery calculations that have been solved adding the following parameters as suggested by GeoSpark website itself:

spark.executor.memory 4g
spark.driver.memory 10g
spark.network.timeout 10000s
spark.driver.maxResultSize 5g

0人赞添加讨论(0) 举报

How to avoid gc overhead limit exceeded in a range

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间