Apache Spark: SIGTERM mystery with dynamic allocation

20.4.2017 Petr Sobeslavsky Comments 0 Comment

Managing jobs in Apache Spark is a challenge. And it becomes even bigger challenge when your executors get stopped without any meaningful error message.

I was running a cluster with eight nodes and a computation job which was supposed to run for around 10 hours. However after just six hours the application got stuck with eight active workers but no tasks running. It seemed like the application just became lazy and didn’t want to do any work anymore. Here are the steps I followed to debug the issue:

Step 1. The cause

The only message related to the issue was the one I found in Spark worker log:

ERROR CoarseGrainedExecutorBackend: RECEIVED SIGNAL 15: SIGTERM

After some digging I have found that the problem is most likely caused by the executor allocating too much memory and getting terminated by something. All I found on Google related to SIGTERM issue in Spark was about YARN resource manager terminating executors. As I was running Spark in standalone mode my guess is that it must have been either Spark resource manager doing this for me or the operating system.

What was more surprising was that Spark was not trying to allocate new executors after the existing ones failed. From Spark documentation it seems that Dynamic resource allocation is supposed to do exactly that, but in this case it was not.

Step 2. The executor story

After further investigation it turned out that Dynamic allocation was indeed trying to allocate a new executor, but the new executor was terminated after being created. The following was happening:

An executor which was executing tasks for several hours received SIGTERM and ended.
A new executor was created, but received SIGTERM
A new executor was created, but received SIGTERM

Step 3. The master mystery

Reading Master logs I have noticed a kind of erratic behavior of the application with respect to executors:

17/04/18 22:11:44 INFO Master: Application app-20170418163116-0001 requested to set total executors to 7.
17/04/18 22:11:47 INFO Master: Application app-20170418163116-0001 requested to set total executors to 6.
17/04/18 22:11:49 INFO Master: Application app-20170418163116-0001 requested to set total executors to 5.
17/04/18 22:11:51 INFO Master: Application app-20170418163116-0001 requested to set total executors to 4.
17/04/18 22:11:53 INFO Master: Application app-20170418163116-0001 requested to set total executors to 3.
17/04/18 22:11:55 INFO Master: Application app-20170418163116-0001 requested to set total executors to 2.
17/04/18 22:11:57 INFO Master: Application app-20170418163116-0001 requested to set total executors to 1.
17/04/18 22:12:09 INFO Master: Application app-20170418163116-0001 requested to set total executors to 0.
17/04/18 22:12:16 INFO Master: Application app-20170418163116-0001 requested to set total executors to 9.
17/04/18 22:12:16 INFO Master: Application app-20170418163116-0001 requested to set total executors to 10.
17/04/18 22:12:16 INFO Master: Application app-20170418163116-0001 requested to set total executors to 12.
17/04/18 22:12:16 INFO Master: Application app-20170418163116-0001 requested to set total executors to 16.
17/04/18 22:12:16 INFO Master: Application app-20170418163116-0001 requested to set total executors to 24.
17/04/18 22:12:16 INFO Master: Application app-20170418163116-0001 requested to set total executors to 25.
17/04/18 22:12:30 INFO Master: Application app-20170418163116-0001 requested to set total executors to 24.

And then again from 24 down to 0. The second time however, the application remained with 0 executors, not doing any work at all.

The mystery is: Why would an application with a full queue of tasks give up all the executors?

Step 4. The solution

The solution was to make application never give up all executors. Fortunately it can be done easily by setting spark.dynamicAllocation.minExecutors to a reasonable value (16 in this case). It prevents the application from terminating all the executors and going to sleep thinking there is not more work to be done.

dev.sobeslavsky.net