Running spark-shell on Kubernetes

3 min readMay 8, 2021

Spark’s documentation details the method of running on a cluster managed by Kubernetes. But it mainly involves using spark-submit to directly submit Spark applications to Kubernetes cluster in cluster mode. This article will introduce the REPL environment for running spark-shell on a Kubernetes cluster in client mode.

The configuration shell script is given directly above. When the kubectl run is run, a Spark Client that keeps running will be generated. Finally, kubectl exec is used to start the simplest spark-shell environment. According to our configuration, there are 4 executor instances, and the pods are as follows:

➜ kubectl get po
NAME                                  READY   STATUS    RESTARTS   AGE
spark-client                          1/1     Running   0          2m29s
spark-shell-b53ca77948f9db74-exec-1   1/1     Running   0          104s
spark-shell-b53ca77948f9db74-exec-2   1/1     Running   0          104s
spark-shell-b53ca77948f9db74-exec-3   1/1     Running   0          104s
spark-shell-b53ca77948f9db74-exec-4   1/1     Running   0          104s

Customized Spark image

You can see that the image of spark-on-k8s-operator is used directly in the configuration of environment variables:

export POD_IMAG=gcr.io/spark-operator/spark-operator:v1beta2-1.2.3-3.1.1

In fact, you can also use your own customized image, but you need to set the user as root. For example, after I downloaded the original Spark package directly:

➜ tar -zxf ~/Downloads/spark-3.1.1-bin-hadoop3.2.tgz
➜ cd spark-3.1.1-bin-hadoop3.2
➜ vi kubernetes/dockerfiles/spark/Dockerfile

Comment the last line and add

# Specify the User that the actual main process will run as
#USER ${spark_uid}
USER root

Build an image after saving and upload:

➜ docker build -t echang0929/dswspark-runner:v3.1.1 -f kubernetes/dockerfiles/spark/Dockerfile .
➜ docker push echang0929/dswspark-runner:v3.1.1

In this way, you only need to set the previous environment variable to this image:

export POD_IMAG=echang0929/dswspark-runner:v3.1.1

Use the previous shell script to generate the most “clean” spark-shell REPL on Kubernetes environment.

Cluster Endpoint URL

The environment variable CLS_ENDP can be obtained by the following command:

➜ kubectl cluster-info
Kubernetes control plane is running at https://c2.us-south.containers.cloud.ibm.com:26165
...

You can copy and paste the output URL into the definition of the environment variable. But if you want to use a command-line in the definition, you need to handle the 256-color escapes before and after the URL string:

tmp=$(k cluster-info | grep "Kubernetes control plane")export CLS_ENDP=$(echo "${tmp##* }" | sed $'s,\x1b\\[[0-9;]*[a-zA-Z],,g')

In addition, because I use IBM Public Cloud, I can also get the URL via the cluster name using cloud commands:

export CLS_CNXT=DSW-DIA-DataFlow-Testingexport CLS_ENDP=$(bx ks cluster ls --json | jq 'map(select( .name == "'$CLS_CNXT'"))[] | .serverURL' | xargs)

Spark Application Management

If you use “:quit” command in spark-shell REPL, you can exit the environment normally. All executor pods are terminated. Although the Spark client pod is still running, the driver process in it has also been terminated.

If there is no operation in the REPL environment for a long time, because of the mechanism of Kubernetes, it will jump out of the REPL environment on Kubernetes and return to the local Shell environment. At this time, all the executor pods and the driver process in the Spark client keep running. We need to kill them. You can’t kill executor pods directly, because Kubernetes will think that these pods have been abnormally terminated and automatically generate new ones.

Because the driver pod name is the client name, we can use Spark Application Management to handle this situation. The following command can get the status of the driver:

➜ kubectl exec -it $CLN_NAME -- sh -c '\
cd /opt/spark/; \
./bin/spark-submit \
--status "'$CLN_NAME'" \
--master k8s://"'$CLS_ENDP'"'

The following command will terminate all executor pods and driver process:

➜ kubectl exec -it $CLN_NAME -- sh -c '\
cd /opt/spark/; \
./bin/spark-submit \
--kill "'$CLN_NAME'" \
--master k8s://"'$CLS_ENDP'"'

At this time, even the Spark client pod will be terminated. If you need to enter the spark-shell REPL environment again, you need to execute “kubectl run $CLN_NAME” and “kubectl exec -it $CLN_NAME” commands in the previous shell script.

Running spark-shell on Kubernetes

Customized Spark image

Cluster Endpoint URL

Spark Application Management

Written by Timothy Zhang