Spark on Mesos (DC/OS) loses tasks before doing an

2019-06-03 02:28发布

I'm experiencing serious problem with DC/OS Apache Spark running in Mesos cluster mode. I'm using GCE instances for DC/OS for testing. Everything was working fine with spark, Spark just suddenly stopped working properly.

Before it starts executors it says that some executors are lost and it blacklists nodes leaving very little resources so it runs ~2-3 executors. Then it proceeds to remove non-existing executors but I don't know if it succeeds.

I wanted to disable blacklisting but it's hardcoded in Spark mesos files and impossible to disable or change any configuration.

This happens mainly when limiting cores per executor, making Spark driver create few executors on one node. Without this setting Spark driver creates one executor per node, driver shows that they are lost however nodes don't get blacklisted because it tried to run one executor per node and blacklist limit is 2 executor fails.

Spark dynamic allocation however works completely fine with no errors and no lost tasks.

log of the situation:

17/10/04 06:13:53 INFO ContextHandler: Started o.s.j.s.ServletContextHandler@131fcb6f{/SQL,null,AVAILABLE,@Spark}
17/10/04 06:13:53 INFO ContextHandler: Started o.s.j.s.ServletContextHandler@878537d{/SQL/json,null,AVAILABLE,@Spark}
17/10/04 06:13:53 INFO ContextHandler: Started o.s.j.s.ServletContextHandler@6c37bd27{/SQL/execution,null,AVAILABLE,@Spark}
17/10/04 06:13:53 INFO ContextHandler: Started o.s.j.s.ServletContextHandler@30331109{/SQL/execution/json,null,AVAILABLE,@Spark}
17/10/04 06:13:53 INFO ContextHandler: Started o.s.j.s.ServletContextHandler@4c27d39d{/static/sql,null,AVAILABLE,@Spark}
17/10/04 06:13:53 INFO MesosCoarseGrainedSchedulerBackend: Mesos task 0 is now TASK_LOST
17/10/04 06:13:53 INFO MesosCoarseGrainedSchedulerBackend: Mesos task 3 is now TASK_LOST
17/10/04 06:13:53 INFO MesosCoarseGrainedSchedulerBackend: Mesos task 1 is now TASK_LOST
17/10/04 06:13:53 INFO MesosCoarseGrainedSchedulerBackend: Blacklisting Mesos slave d70a5cac-18e6-4365-99cd-cf5cad357eb5-S0 due to too many failures; is Spark installed on it?
17/10/04 06:13:53 INFO MesosCoarseGrainedSchedulerBackend: Mesos task 5 is now TASK_LOST
17/10/04 06:13:53 INFO MesosCoarseGrainedSchedulerBackend: Mesos task 4 is now TASK_LOST
17/10/04 06:13:53 INFO MesosCoarseGrainedSchedulerBackend: Blacklisting Mesos slave d70a5cac-18e6-4365-99cd-cf5cad357eb5-S3 due to too many failures; is Spark installed on it?
17/10/04 06:13:53 INFO BlockManagerMaster: Removal of executor 0 requested
17/10/04 06:13:53 INFO MesosCoarseGrainedSchedulerBackend: Mesos task 2 is now TASK_LOST
17/10/04 06:13:53 INFO MesosCoarseGrainedSchedulerBackend: Blacklisting Mesos slave d70a5cac-18e6-4365-99cd-cf5cad357eb5-S3 due to too many failures; is Spark installed on it?
17/10/04 06:13:53 INFO CoarseGrainedSchedulerBackend$DriverEndpoint: Asked to remove non-existent executor 0
17/10/04 06:13:53 INFO BlockManagerMaster: Removal of executor 3 requested
17/10/04 06:13:53 INFO CoarseGrainedSchedulerBackend$DriverEndpoint: Asked to remove non-existent executor 3
17/10/04 06:13:53 INFO BlockManagerMaster: Removal of executor 1 requested
17/10/04 06:13:53 INFO CoarseGrainedSchedulerBackend$DriverEndpoint: Asked to remove non-existent executor 1
17/10/04 06:13:53 INFO BlockManagerMaster: Removal of executor 5 requested
17/10/04 06:13:53 INFO CoarseGrainedSchedulerBackend$DriverEndpoint: Asked to remove non-existent executor 5
17/10/04 06:13:53 INFO BlockManagerMaster: Removal of executor 4 requested
17/10/04 06:13:53 INFO CoarseGrainedSchedulerBackend$DriverEndpoint: Asked to remove non-existent executor 4
17/10/04 06:13:53 INFO BlockManagerMaster: Removal of executor 2 requested
17/10/04 06:13:53 INFO CoarseGrainedSchedulerBackend$DriverEndpoint: Asked to remove non-existent executor 2
17/10/04 06:13:54 INFO BlockManagerMasterEndpoint: Trying to remove executor 0 from BlockManagerMaster.
17/10/04 06:13:54 INFO BlockManagerMasterEndpoint: Trying to remove executor 3 from BlockManagerMaster.
17/10/04 06:13:54 INFO BlockManagerMasterEndpoint: Trying to remove executor 1 from BlockManagerMaster.
17/10/04 06:13:54 INFO BlockManagerMasterEndpoint: Trying to remove executor 5 from BlockManagerMaster.
17/10/04 06:13:54 INFO BlockManagerMasterEndpoint: Trying to remove executor 4 from BlockManagerMaster.
17/10/04 06:13:54 INFO BlockManagerMasterEndpoint: Trying to remove executor 2 from BlockManagerMaster.
17/10/04 06:13:55 INFO StateStoreCoordinatorRef: Registered StateStoreCoordinator endpoint
17/10/04 06:13:55 INFO MesosCoarseGrainedSchedulerBackend: Mesos task 6 is now TASK_RUNNING
2017-10-04 06:13:55,847:7(0x7f56dcb82700):ZOO_WARN@zookeeper_interest@1570: Exceeded deadline by 32ms
17/10/04 06:13:55 INFO MesosCoarseGrainedSchedulerBackend: Mesos task 7 is now TASK_RUNNING

And because two almost empty nodes got blacklisted it runs less executors using what is left in cluster.

Unfortunately I can't give DEBUG logs or anything like that because it seems that it's impossible to use custom log4j file(I followed what people wrote to do and nothing)

Entire log from the beginning(without fetching files) do I miss something here?:

I1004 09:19:27.268869  5311 exec.cpp:162] Version: 1.4.0
I1004 09:19:27.273120  5317 exec.cpp:237] Executor registered on agent e493efb0-fffc-4ddc-a697-3c0a0108f8e5-S2
I1004 09:19:27.273859  5315 executor.cpp:120] Registered docker executor on 10.132.0.12
I1004 09:19:27.273959  5317 executor.cpp:160] Starting task driver-20171004091925-0004
Using bootstrap to set SPARK_LOCAL_IP
2017/10/04 09:19:27 Printing new task IP: 10.132.0.12
SPARK_LOCAL_IP = 10.132.0.12
spark-env: User: root
spark-env: No kerberos KDC config found
17/10/04 09:19:29 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
17/10/04 09:19:30 INFO SparkContext: Running Spark version 2.2.0
17/10/04 09:19:30 INFO SparkContext: Submitted application: UserEventsReader$
17/10/04 09:19:30 INFO SecurityManager: Changing view acls to: root
17/10/04 09:19:30 INFO SecurityManager: Changing modify acls to: root
17/10/04 09:19:30 INFO SecurityManager: Changing view acls groups to: 
17/10/04 09:19:30 INFO SecurityManager: Changing modify acls groups to: 
17/10/04 09:19:30 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users  with view permissions: Set(root); groups with view permissions: Set(); users  with modify permissions: Set(root); groups with modify permissions: Set()
17/10/04 09:19:30 INFO Utils: Successfully started service 'sparkDriver' on port 45577.
17/10/04 09:19:30 INFO SparkEnv: Registering MapOutputTracker
17/10/04 09:19:31 INFO SparkEnv: Registering BlockManagerMaster
17/10/04 09:19:31 INFO BlockManagerMasterEndpoint: Using org.apache.spark.storage.DefaultTopologyMapper for getting topology information
17/10/04 09:19:31 INFO BlockManagerMasterEndpoint: BlockManagerMasterEndpoint up
17/10/04 09:19:31 INFO DiskBlockManager: Created local directory at /tmp/blockmgr-a1289f49-471f-4b35-9f23-7b3ba2ad1b90
17/10/04 09:19:31 INFO MemoryStore: MemoryStore started with capacity 912.3 MB
17/10/04 09:19:31 INFO SparkEnv: Registering OutputCommitCoordinator
17/10/04 09:19:31 INFO log: Logging initialized @3693ms
17/10/04 09:19:31 INFO Server: jetty-9.3.z-SNAPSHOT
17/10/04 09:19:31 INFO Server: Started @3881ms
17/10/04 09:19:31 INFO AbstractConnector: Started ServerConnector@5707c1cb{HTTP/1.1,[http/1.1]}{0.0.0.0:4040}
17/10/04 09:19:31 INFO Utils: Successfully started service 'SparkUI' on port 4040.
17/10/04 09:19:31 INFO ContextHandler: Started o.s.j.s.ServletContextHandler@fac80{/jobs,null,AVAILABLE,@Spark}
17/10/04 09:19:31 INFO ContextHandler: Started o.s.j.s.ServletContextHandler@48e64352{/jobs/json,null,AVAILABLE,@Spark}
17/10/04 09:19:31 INFO ContextHandler: Started o.s.j.s.ServletContextHandler@4362d7df{/jobs/job,null,AVAILABLE,@Spark}
17/10/04 09:19:31 INFO ContextHandler: Started o.s.j.s.ServletContextHandler@200606de{/jobs/job/json,null,AVAILABLE,@Spark}
17/10/04 09:19:31 INFO ContextHandler: Started o.s.j.s.ServletContextHandler@f8908f6{/stages,null,AVAILABLE,@Spark}
17/10/04 09:19:31 INFO ContextHandler: Started o.s.j.s.ServletContextHandler@2ef8a8c3{/stages/json,null,AVAILABLE,@Spark}
17/10/04 09:19:31 INFO ContextHandler: Started o.s.j.s.ServletContextHandler@63fd4873{/stages/stage,null,AVAILABLE,@Spark}
17/10/04 09:19:31 INFO ContextHandler: Started o.s.j.s.ServletContextHandler@7957dc72{/stages/stage/json,null,AVAILABLE,@Spark}
17/10/04 09:19:31 INFO ContextHandler: Started o.s.j.s.ServletContextHandler@3aacf32a{/stages/pool,null,AVAILABLE,@Spark}
17/10/04 09:19:31 INFO ContextHandler: Started o.s.j.s.ServletContextHandler@82c57b3{/stages/pool/json,null,AVAILABLE,@Spark}
17/10/04 09:19:31 INFO ContextHandler: Started o.s.j.s.ServletContextHandler@600b0b7{/storage,null,AVAILABLE,@Spark}
17/10/04 09:19:31 INFO ContextHandler: Started o.s.j.s.ServletContextHandler@5ea502e0{/storage/json,null,AVAILABLE,@Spark}
17/10/04 09:19:31 INFO ContextHandler: Started o.s.j.s.ServletContextHandler@473b3b7a{/storage/rdd,null,AVAILABLE,@Spark}
17/10/04 09:19:31 INFO ContextHandler: Started o.s.j.s.ServletContextHandler@77b7ffa4{/storage/rdd/json,null,AVAILABLE,@Spark}
17/10/04 09:19:31 INFO ContextHandler: Started o.s.j.s.ServletContextHandler@402f80f5{/environment,null,AVAILABLE,@Spark}
17/10/04 09:19:31 INFO ContextHandler: Started o.s.j.s.ServletContextHandler@133e019b{/environment/json,null,AVAILABLE,@Spark}
17/10/04 09:19:31 INFO ContextHandler: Started o.s.j.s.ServletContextHandler@7dac3fd8{/executors,null,AVAILABLE,@Spark}
17/10/04 09:19:31 INFO ContextHandler: Started o.s.j.s.ServletContextHandler@2102a4d5{/executors/json,null,AVAILABLE,@Spark}
17/10/04 09:19:31 INFO ContextHandler: Started o.s.j.s.ServletContextHandler@3d4d3fe7{/executors/threadDump,null,AVAILABLE,@Spark}
17/10/04 09:19:31 INFO ContextHandler: Started o.s.j.s.ServletContextHandler@51684e4a{/executors/threadDump/json,null,AVAILABLE,@Spark}
17/10/04 09:19:31 INFO ContextHandler: Started o.s.j.s.ServletContextHandler@38875e7d{/static,null,AVAILABLE,@Spark}
17/10/04 09:19:31 INFO ContextHandler: Started o.s.j.s.ServletContextHandler@1a6f5124{/,null,AVAILABLE,@Spark}
17/10/04 09:19:31 INFO ContextHandler: Started o.s.j.s.ServletContextHandler@ec2bf82{/api,null,AVAILABLE,@Spark}
17/10/04 09:19:31 INFO ContextHandler: Started o.s.j.s.ServletContextHandler@669253b7{/jobs/job/kill,null,AVAILABLE,@Spark}
17/10/04 09:19:31 INFO ContextHandler: Started o.s.j.s.ServletContextHandler@51a06cbe{/stages/stage/kill,null,AVAILABLE,@Spark}
17/10/04 09:19:31 INFO SparkUI: Bound SparkUI to 10.132.0.12, and started at http://10.132.0.12:4040
17/10/04 09:19:31 INFO SparkContext: Added JAR file:/mnt/mesos/sandbox/SparkWindow-1.0.jar at spark://10.132.0.12:45577/jars/SparkWindow-1.0.jar with timestamp 1507108771967
2017-10-04 09:19:32,651:7(0x7f9058164700):ZOO_INFO@log_env@726: Client environment:zookeeper.version=zookeeper C client 3.4.8
2017-10-04 09:19:32,651:7(0x7f9058164700):ZOO_INFO@log_env@730: Client environment:host.name=private-2
2017-10-04 09:19:32,651:7(0x7f9058164700):ZOO_INFO@log_env@737: Client environment:os.name=Linux
2017-10-04 09:19:32,651:7(0x7f9058164700):ZOO_INFO@log_env@738: Client environment:os.arch=3.10.0-514.26.2.el7.x86_64
2017-10-04 09:19:32,651:7(0x7f9058164700):ZOO_INFO@log_env@739: Client environment:os.version=#1 SMP Tue Jul 4 15:04:05 UTC 2017
2017-10-04 09:19:32,651:7(0x7f9058164700):ZOO_INFO@log_env@747: Client environment:user.name=(null)
2017-10-04 09:19:32,651:7(0x7f9058164700):ZOO_INFO@log_env@755: Client environment:user.home=/root
2017-10-04 09:19:32,651:7(0x7f9058164700):ZOO_INFO@log_env@767: Client environment:user.dir=/mnt/mesos/sandbox
2017-10-04 09:19:32,651:7(0x7f9058164700):ZOO_INFO@zookeeper_init@800: Initiating client connection, host=master.mesos:2181 sessionTimeout=10000 watcher=0x7f906202f4d0 sessionId=0 sessionPasswd=<null> context=0x7f909400bed0 flags=0
2017-10-04 09:19:32,658:7(0x7f90539de700):ZOO_INFO@check_events@1728: initiated connection to server [10.132.0.10:2181]
I1004 09:19:32.658676    93 sched.cpp:232] Version: 1.4.0
2017-10-04 09:19:32,663:7(0x7f90539de700):ZOO_INFO@check_events@1775: session establishment complete on server [10.132.0.10:2181], sessionId=0x15ee68b946c0063, negotiated timeout=10000
I1004 09:19:32.664188    87 group.cpp:341] Group process (zookeeper-group(1)@10.132.0.12:34557) connected to ZooKeeper
I1004 09:19:32.664244    87 group.cpp:831] Syncing group operations: queue size (joins, cancels, datas) = (0, 0, 0)
I1004 09:19:32.664261    87 group.cpp:419] Trying to create path '/mesos' in ZooKeeper
I1004 09:19:32.665998    87 detector.cpp:152] Detected a new leader: (id='32')
I1004 09:19:32.666154    87 group.cpp:700] Trying to get '/mesos/json.info_0000000032' in ZooKeeper
I1004 09:19:32.670025    88 zookeeper.cpp:262] A new leading master (UPID=master@10.132.0.10:5050) is detected
I1004 09:19:32.670140    88 sched.cpp:336] New master detected at master@10.132.0.10:5050
I1004 09:19:32.670532    88 sched.cpp:352] No credentials provided. Attempting to register without authentication
I1004 09:19:32.676714    85 sched.cpp:759] Framework registered with ed329c1d-e021-4b11-a9c3-c9bf02c26921-0000-driver-20171004091925-0004
17/10/04 09:19:32 INFO Utils: Successfully started service 'org.apache.spark.network.netty.NettyBlockTransferService' on port 40838.
17/10/04 09:19:32 INFO NettyBlockTransferService: Server created on 10.132.0.12:40838
17/10/04 09:19:32 INFO BlockManager: Using org.apache.spark.storage.RandomBlockReplicationPolicy for block replication policy
17/10/04 09:19:32 INFO BlockManagerMaster: Registering BlockManager BlockManagerId(driver, 10.132.0.12, 40838, None)
17/10/04 09:19:32 INFO BlockManagerMasterEndpoint: Registering block manager 10.132.0.12:40838 with 912.3 MB RAM, BlockManagerId(driver, 10.132.0.12, 40838, None)
17/10/04 09:19:32 INFO BlockManagerMaster: Registered BlockManager BlockManagerId(driver, 10.132.0.12, 40838, None)
17/10/04 09:19:32 INFO BlockManager: Initialized BlockManager: BlockManagerId(driver, 10.132.0.12, 40838, None)
17/10/04 09:19:33 INFO ContextHandler: Started o.s.j.s.ServletContextHandler@66bfd864{/metrics/json,null,AVAILABLE,@Spark}
17/10/04 09:19:33 INFO MesosCoarseGrainedSchedulerBackend: SchedulerBackend is ready for scheduling beginning after reached minRegisteredResourcesRatio: 0.0
17/10/04 09:19:34 INFO SharedState: Setting hive.metastore.warehouse.dir ('null') to the value of spark.sql.warehouse.dir ('file:/mnt/mesos/sandbox/spark-warehouse').
17/10/04 09:19:34 INFO SharedState: Warehouse path is 'file:/mnt/mesos/sandbox/spark-warehouse'.
17/10/04 09:19:34 INFO ContextHandler: Started o.s.j.s.ServletContextHandler@878537d{/SQL,null,AVAILABLE,@Spark}
17/10/04 09:19:34 INFO ContextHandler: Started o.s.j.s.ServletContextHandler@29fc1a2b{/SQL/json,null,AVAILABLE,@Spark}
17/10/04 09:19:34 INFO ContextHandler: Started o.s.j.s.ServletContextHandler@30331109{/SQL/execution,null,AVAILABLE,@Spark}
17/10/04 09:19:34 INFO ContextHandler: Started o.s.j.s.ServletContextHandler@74fe5966{/SQL/execution/json,null,AVAILABLE,@Spark}
17/10/04 09:19:34 INFO ContextHandler: Started o.s.j.s.ServletContextHandler@7bde1f3a{/static/sql,null,AVAILABLE,@Spark}
17/10/04 09:19:34 INFO MesosCoarseGrainedSchedulerBackend: Mesos task 0 is now TASK_LOST
17/10/04 09:19:34 INFO MesosCoarseGrainedSchedulerBackend: Mesos task 4 is now TASK_LOST
17/10/04 09:19:34 INFO MesosCoarseGrainedSchedulerBackend: Mesos task 1 is now TASK_LOST
17/10/04 09:19:34 INFO MesosCoarseGrainedSchedulerBackend: Blacklisting Mesos slave e493efb0-fffc-4ddc-a697-3c0a0108f8e5-S2 due to too many failures; is Spark installed on it?
17/10/04 09:19:34 INFO MesosCoarseGrainedSchedulerBackend: Mesos task 6 is now TASK_LOST
17/10/04 09:19:34 INFO MesosCoarseGrainedSchedulerBackend: Mesos task 5 is now TASK_LOST
17/10/04 09:19:34 INFO MesosCoarseGrainedSchedulerBackend: Blacklisting Mesos slave e493efb0-fffc-4ddc-a697-3c0a0108f8e5-S0 due to too many failures; is Spark installed on it?
17/10/04 09:19:34 INFO MesosCoarseGrainedSchedulerBackend: Mesos task 2 is now TASK_LOST
17/10/04 09:19:34 INFO MesosCoarseGrainedSchedulerBackend: Blacklisting Mesos slave e493efb0-fffc-4ddc-a697-3c0a0108f8e5-S0 due to too many failures; is Spark installed on it?
17/10/04 09:19:34 INFO BlockManagerMaster: Removal of executor 0 requested
17/10/04 09:19:34 INFO MesosCoarseGrainedSchedulerBackend: Mesos task 3 is now TASK_LOST
17/10/04 09:19:34 INFO CoarseGrainedSchedulerBackend$DriverEndpoint: Asked to remove non-existent executor 0
17/10/04 09:19:34 INFO BlockManagerMaster: Removal of executor 4 requested
17/10/04 09:19:34 INFO CoarseGrainedSchedulerBackend$DriverEndpoint: Asked to remove non-existent executor 4
17/10/04 09:19:34 INFO BlockManagerMaster: Removal of executor 1 requested
17/10/04 09:19:34 INFO CoarseGrainedSchedulerBackend$DriverEndpoint: Asked to remove non-existent executor 1
17/10/04 09:19:34 INFO BlockManagerMaster: Removal of executor 6 requested
17/10/04 09:19:34 INFO CoarseGrainedSchedulerBackend$DriverEndpoint: Asked to remove non-existent executor 6
17/10/04 09:19:34 INFO BlockManagerMaster: Removal of executor 5 requested
17/10/04 09:19:34 INFO CoarseGrainedSchedulerBackend$DriverEndpoint: Asked to remove non-existent executor 5
17/10/04 09:19:34 INFO BlockManagerMaster: Removal of executor 2 requested
17/10/04 09:19:34 INFO CoarseGrainedSchedulerBackend$DriverEndpoint: Asked to remove non-existent executor 2
17/10/04 09:19:34 INFO BlockManagerMaster: Removal of executor 3 requested
17/10/04 09:19:34 INFO CoarseGrainedSchedulerBackend$DriverEndpoint: Asked to remove non-existent executor 3
17/10/04 09:19:34 INFO BlockManagerMasterEndpoint: Trying to remove executor 0 from BlockManagerMaster.
17/10/04 09:19:34 INFO BlockManagerMasterEndpoint: Trying to remove executor 4 from BlockManagerMaster.
17/10/04 09:19:34 INFO BlockManagerMasterEndpoint: Trying to remove executor 1 from BlockManagerMaster.
17/10/04 09:19:34 INFO BlockManagerMasterEndpoint: Trying to remove executor 6 from BlockManagerMaster.
17/10/04 09:19:34 INFO BlockManagerMasterEndpoint: Trying to remove executor 5 from BlockManagerMaster.
17/10/04 09:19:34 INFO BlockManagerMasterEndpoint: Trying to remove executor 2 from BlockManagerMaster.
17/10/04 09:19:34 INFO BlockManagerMasterEndpoint: Trying to remove executor 3 from BlockManagerMaster.
2017-10-04 09:19:36,041:7(0x7f90539de700):ZOO_WARN@zookeeper_interest@1570: Exceeded deadline by 43ms
17/10/04 09:19:36 INFO MesosCoarseGrainedSchedulerBackend: Mesos task 7 is now TASK_RUNNING
17/10/04 09:19:36 INFO StateStoreCoordinatorRef: Registered StateStoreCoordinator endpoint
17/10/04 09:19:36 INFO MesosCoarseGrainedSchedulerBackend: Mesos task 8 is now TASK_RUNNING
17/10/04 09:19:36 INFO NettyUtil: Found Netty's native epoll transport in the classpath, using it

0条回答
登录 后发表回答